From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5A756CD98F7 for ; Wed, 11 Oct 2023 07:42:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EB60D80048; Wed, 11 Oct 2023 03:42:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E66908D0002; Wed, 11 Oct 2023 03:42:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D551C80048; Wed, 11 Oct 2023 03:42:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id C70C48D0002 for ; Wed, 11 Oct 2023 03:42:58 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 969D0801E0 for ; Wed, 11 Oct 2023 07:42:58 +0000 (UTC) X-FDA: 81332389236.22.3DE11CA Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf07.hostedemail.com (Postfix) with ESMTP id 87D0E40006 for ; Wed, 11 Oct 2023 07:42:56 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf07.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697010176; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=/TBbVnXpalg1x/1ippwrhFGJNx2cA7m5c0dV7C7E9+s=; b=nGwf1yEyi/JYyUR6zBuSYAoF2iXCVdQeqxJnO+C4UxEr7OzGqvA8lY8chDOV4kwhu9Ds1d o3rjSAU3a37BVLAPAAbj3oCd7ToRBWf3vngANFL7GdJ+aX7CXw0k5VExe43U1EBhH0nt39 Z6zupTihdTrH9oazxzrq3PAguNsJ9zg= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf07.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697010177; a=rsa-sha256; cv=none; b=1b7RJDCsC4gkB9a2FHjK+mLc9ljHShYHWgzx8aU+Cx29Akg7aQEbInfa6CAOGW1QRT1IWV rCXXoDfdwN87Gc+3SXgdH76UiFTClBjR99eX3J9MU5gzsKWMqgZ8bjKixZ4R3FdWlTeqc4 yMeMPKaUrc78co3HOoGMefWti05Ou8w= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 4B26CC15; Wed, 11 Oct 2023 00:43:36 -0700 (PDT) Received: from [10.57.69.15] (unknown [10.57.69.15]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 2DEDF3F5A1; Wed, 11 Oct 2023 00:42:54 -0700 (PDT) Message-ID: <45130848-2a12-4e9f-b325-e7da87ff3151@arm.com> Date: Wed, 11 Oct 2023 08:42:52 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v1 0/2] Swap-out small-sized THP without splitting To: "Huang, Ying" Cc: Andrew Morton , David Hildenbrand , Matthew Wilcox , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20231010142111.3997780-1-ryan.roberts@arm.com> <87zg0pfyux.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Language: en-GB From: Ryan Roberts In-Reply-To: <87zg0pfyux.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 87D0E40006 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: jb6o7jqcucc97air4q6auj46sdk1m5br X-HE-Tag: 1697010176-848535 X-HE-Meta: U2FsdGVkX18eMe4tCfpDiy0YZOGf26ufkxv6jJpmXaf/Ai+EnD61aK/IZuWCJgZ/A7trVxQwS8FFjbH7lH++Ej/1VhKObesP/b+5YAfoMpMG3/aKe6ZXe37ut25dTMGJI+rhB9iUvvprV5zPj2sozoQExwjbArb0fcE8C2e+ZRvdKo1W5imSqfh6IY4HghNzq5YoOsNJqfcuN7oJ+miOqzXuuM/8RRot25lecqhFw+58/o8tmJYdz5kS8fPgBNd5X4cPxhWrXJlIlRGJF6TJs4YFnTRBVHu6qr51wqV1hGMWveIvR2GMb1mLdH9JdKrBpZr3K4+tfVmHMtd1zmYLSrAc+SRfq5p/n0YLmqQjPy0qYrlj6EYoRUm3GZv8fJ05bLVYR0R+zaYixe35S1Jw5vqipw3xMHQE2A4Pct72uPVU23/d26uE31MYMKSwuZhCGdrmU76Sq2ot4Xk3pQAtJH9TUq9VanbDsrimo+J0eRHQi5wC1xM9vhlG0l+v1Sq43/ZxiDvc+eKIlebGphXmvQ0R3JLN3r/UMp10WLWJcE/MmhkvHeqVEK87mWZs2DjP9e477BnrbMzKEalp6upXVBAQKDykwtMnLXRA6Q7vBX+kHsTcL8xkNSOLBSQeuxl2LdkE/l51ddDQPSvt2OR0bwEBMrq6lbgDf7d8t6ku6IX9JfQo0fHWJZ7QK1lBjnXFa5kncCiCSXpl7sxvG/+NBPzcazDxMoh5E6P4wsUfZQjvSxBB+WshlSBznYu4hM1ZaW5lELdnD7Jttten24DTwgKwj5WcleJ1azWx4zyBE5Stpy6NRtFho1Od2KGhhLhbJzpX1Gm+kSU+PY4idnOcpbjQ5nTqiB+CWzSIA4GNSpPba83O4KhSrHYYDEnQkGEBDFlKEXB2mhmkF00MVYEERmSktiPpZq8atnBOm3gRVjCuGFKFa/94fvudYOeWO6wTx0HMvHNTTX4FPmCc1c/ 4GgPA7gf /fWZ1Runm+n0krkGDSSQPxv/yiUclFCOwmIaYH6Ui2rzkjYPzQudNnCiS3D7eWsvcaON7OUk4wNe7UtODIlKuotaZfSzbZpudiugL3PzVGVHYNsGMrachxr88XSM/r0ZJPo45I8DLrIIQL/RmhIQD1NbCcXgCmOuH3H2RZkZ/uOY5N7qnmyhH9bVQ9fiJ1gZRowVE5CQjZTSml7MDPXLGbYNvjPHr2fdF4ybkOjKfVwoogXXPSiLClBXjW4fQtZroHc6c37Q2U7N0n68= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 11/10/2023 07:37, Huang, Ying wrote: > Ryan Roberts writes: > >> Hi All, >> >> This is an RFC for a small series to add support for swapping out small-sized >> THP without needing to first split the large folio via __split_huge_page(). It >> closely follows the approach already used by PMD-sized THP. >> >> "Small-sized THP" is an upcoming feature that enables performance improvements >> by allocating large folios for anonymous memory, where the large folio size is >> smaller than the traditional PMD-size. See [1]. >> >> In some circumstances I've observed a performance regression (see patch 2 for >> details), and this series is an attempt to fix the regression in advance of >> merging small-sized THP support. >> >> I've done what I thought was the smallest change possible, and as a result, this >> approach is only employed when the swap is backed by a non-rotating block device >> (just as PMD-sized THP is supported today). However, I have a few questions on >> whether we should consider relaxing those requirements in certain circumstances: >> >> >> 1) block-backed vs file-backed >> ============================== >> >> The code only attempts to allocate a contiguous set of entries if swap is backed >> by a block device (i.e. not file-backed). The original commit, f0eea189e8e9 >> ("mm, THP, swap: don't allocate huge cluster for file backed swap device"), >> stated "It's hard to write a whole transparent huge page (THP) to a file backed >> swap device". But didn't state why. Does this imply there is a size limit at >> which it becomes hard? And does that therefore imply that for "small enough" >> sizes we should now allow use with file-back swap? >> >> This original commit was subsequently fixed with commit 41663430588c ("mm, THP, >> swap: fix allocating cluster for swapfile by mistake"), which said the original >> commit was using the wrong flag to determine if it was a block device and >> therefore in some cases was actually doing large allocations for a file-backed >> swap device, and this was causing file-system corruption. But that implies some >> sort of correctness issue to me, rather than the performance issue I inferred >> from the original commit. >> >> If anyone can offer an explanation, that would be helpful in determining if we >> should allow some large sizes for file-backed swap. > > swap use 'swap extent' (swap_info_struct.swap_extent_root) to map from > swap offset to storage block number. For block-backed swap, the mapping > is pure linear. So, you can use arbitrary large page size. But for > file-backed swap, only PAGE_SIZE alignment is guaranteed. Ahh, I see, so its a correctness issue then. Thanks! > >> 2) rotating vs non-rotating >> =========================== >> >> I notice that the clustered approach is only used for non-rotating swap. That >> implies that for rotating media, we will always fail a large allocation, and >> fall back to splitting THPs to single pages. Which implies that the regression >> I'm fixing here may still be present on rotating media? Or perhaps rotating disk >> is so slow that the cost of writing the data out dominates the cost of >> splitting? >> >> I considered that potentially the free swap entry search algorithm that is used >> in this case could be modified to look for (small) contiguous runs of entries; >> Up to ~16 pages (order-4) could be done by doing 2x 64bit reads from map instead >> of single byte. >> >> I haven't looked into this idea in detail, but wonder if anybody thinks it is >> worth the effort? Or perhaps it would end up causing bad fragmentation. > > I doubt anybody will use rotating storage to back swap now. I'm often using a QEMU VM for testing with an Ubuntu install. The disk is enumerating as rotating storage and the swap device is file-backed. But I guess the former issue at least, is me setting up QEMU with the incorrect options. > >> Finally on testing, I've run the mm selftests and see no regressions, but I >> don't think there is anything in there specifically aimed towards swap? Are >> there any functional or performance tests that I should run? It would certainly >> be good to confirm I haven't regressed PMD-size THP swap performance. > > I have used swap sub test case of vm-scalbility to test. > > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/ Great - I shall take a look! > > -- > Best Regards, > Huang, Ying