From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F2916C54798 for ; Tue, 27 Feb 2024 17:10:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7956F6B015F; Tue, 27 Feb 2024 12:10:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7414F6B0163; Tue, 27 Feb 2024 12:10:13 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5EC226B0164; Tue, 27 Feb 2024 12:10:13 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 4B4EA6B015F for ; Tue, 27 Feb 2024 12:10:13 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id D82981C0F52 for ; Tue, 27 Feb 2024 17:10:12 +0000 (UTC) X-FDA: 81838221864.11.CC4B9B9 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf14.hostedemail.com (Postfix) with ESMTP id B75CF100024 for ; Tue, 27 Feb 2024 17:10:10 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf14.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709053811; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ov50NRiPC9lorjCGR8+KnIqWgPnvQBQ+gJZoQniMfho=; b=KtODBtHtuaqWkqC6Qj8RLxlFJskQuwFFvuraeQm0hzvQSbQkwpa0rncXdPDAYFimJ9374G 8qDsTMQ/rRhsxRW+ErqzdR5OMvhGtIWqo+vYtdfBRRtZY1OzyrNrYk7ZVVNrqGlkz70bpr qKf97HuLf5p9+a3Asjyqc96qZtadQUg= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf14.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709053811; a=rsa-sha256; cv=none; b=E9h2v5/lyUeRuGD5E3mN1ZvDR8lctrm9EbIHhdOUc+UGunpEY/YmG/QGsg1n/Doziyt+TS LS3FH2LTQ7OxnfTQMJ9YRPBrsLJ5cD81nYn6Or8wJI1+7KBdmsBlYObjEgr4yLa4zyNyvA jS0BGuFCiugAjvUiXZ9cg4t76/cH/s0= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id DF400DA7; Tue, 27 Feb 2024 09:10:47 -0800 (PST) Received: from [10.1.30.188] (XHFQ2J9959.cambridge.arm.com [10.1.30.188]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id CB6C93F73F; Tue, 27 Feb 2024 09:10:07 -0800 (PST) Message-ID: Date: Tue, 27 Feb 2024 17:10:06 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Content-Language: en-GB From: Ryan Roberts To: David Hildenbrand , Andrew Morton , Matthew Wilcox , Huang Ying , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20231025144546.577640-1-ryan.roberts@arm.com> <20231025144546.577640-2-ryan.roberts@arm.com> <6541e29b-f25a-48b8-a553-fd8febe85e5a@redhat.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: B75CF100024 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: 4nzbcfowy5o3mer686iacex9uu96eshp X-HE-Tag: 1709053810-378404 X-HE-Meta: U2FsdGVkX19RT8xX+EMTkaCNBmHNOmT3+urI43cuH3EUYWGcbqOeUmDapEE1cJdoBrTdC89yXCSXJSxvN7ij+srVk5a7SHZrU/beFesZwFHF+F2lkjgl1+v5RgmCWDIzDlLTfXAn+oZpar0QmPREMZqIA9w+X8xMFelnztElH5F3U1wL2YfrjG1xypJ+019KImvLivnZS92y1YZioZS/3BoS4qG9ac6Iptam8mKekfNJRfc0semeaB/I1XSNIZIRHIryOZH2XDCkhkCgGiUPnwiy7FoJy+jraCPOlWyX2Jakcs+kUbZD+eJpS9acUJ7esc8xEPHZ3N5XbP4yNZj+pPv+WEb5/pniz55vkdRNX1AtIWt/KhJYJQFydPWJL7LAo6+TLjET/NZw1Ukjz3GpBrCK2c9w/dGI6LTjyTDPL6WP0bbWH0l8WIlK6tyi3qlMmm86DINuELMh9aL8+mHulPlaXd+H9DlHocDowRftkx1400TFunlUOAUYVLDw6fLVmFEPfCXJ/h74IjG5kJSkQAp+na3/5w5INxwH0KA6xsW/ukPU14VKBXQD/RlIRUIoXjj/IZuX5Zk7+d3k408wAOOl4BrCkOX1/v8TPvPRrn0c9j5ld8RKcKKhaT31t0fwUlL35skUfylVrBU71MPl27kh9ixjS0TAAsB1hMCyw9Ynwy6WrigSbTNjSSqODC/o+sWXmSirDz2/ZW3CnKZWvv4ZhEgLSbSRnoo3otgg0M91dRGdYUti12dxglHjYzcs9zyE7jEs9ftCpagNiWsZGaLpLzCOuK90E/dGhiXvdWFT2jAXhILwH62ZfIrpd4pzQHx4iOYjfZaRC8QHlj2MJvfJyFqJPvpSJqKxzHNPpn3s6OkUgGdksfNMcCRmV04sTX6dJL7uIeg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi David, On 26/02/2024 17:41, Ryan Roberts wrote: > On 22/02/2024 10:20, David Hildenbrand wrote: >> On 22.02.24 11:19, David Hildenbrand wrote: >>> On 25.10.23 16:45, Ryan Roberts wrote: >>>> As preparation for supporting small-sized THP in the swap-out path, >>>> without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE, >>>> which, when present, always implies PMD-sized THP, which is the same as >>>> the cluster size. >>>> >>>> The only use of the flag was to determine whether a swap entry refers to >>>> a single page or a PMD-sized THP in swap_page_trans_huge_swapped(). >>>> Instead of relying on the flag, we now pass in nr_pages, which >>>> originates from the folio's number of pages. This allows the logic to >>>> work for folios of any order. >>>> >>>> The one snag is that one of the swap_page_trans_huge_swapped() call >>>> sites does not have the folio. But it was only being called there to >>>> avoid bothering to call __try_to_reclaim_swap() in some cases. >>>> __try_to_reclaim_swap() gets the folio and (via some other functions) >>>> calls swap_page_trans_huge_swapped(). So I've removed the problematic >>>> call site and believe the new logic should be equivalent. >>> >>> That is theĀ  __try_to_reclaim_swap() -> folio_free_swap() -> >>> folio_swapped() -> swap_page_trans_huge_swapped() call chain I assume. >>> >>> The "difference" is that you will now (1) get another temporary >>> reference on the folio and (2) (try)lock the folio every time you >>> discard a single PTE of a (possibly) large THP. >>> >> >> Thinking about it, your change will not only affect THP, but any call to >> free_swap_and_cache(). >> >> Likely that's not what we want. :/ >> > > Is folio_trylock() really that expensive given the code path is already locking > multiple spinlocks, and I don't think we would expect the folio lock to be very > contended? > > I guess filemap_get_folio() could be a bit more expensive, but again, is this > really a deal-breaker? > > > I'm just trying to refamiliarize myself with this series, but I think I ended up > allocating a cluster per cpu per order. So one potential solution would be to > turn the flag into a size and store it in the cluster info. (In fact I think I > was doing that in an early version of this series - will have to look at why I > got rid of that). Then we could avoid needing to figure out nr_pages from the folio. I ran some microbenchmarks to see if these extra operations cause a performance issue - it all looks OK to me. I modified your "pte-mapped-folio-benchmarks" to add a "munmap-swapped-forked" mode, which prepares the 1G memory mapping by first paging it out with MADV_PAGEOUT, then it forks a child (and keeps that child alive) so that the swap slots have 2 references, then it measures the duration of munmap() in the parent on the entire range. The idea is that free_swap_and_cache() is called for each PTE during munmap(). Prior to my change, swap_page_trans_huge_swapped() will return true, due to the child's references, and __try_to_reclaim_swap() is not called. After my change, we no longer have this short cut. In both cases the results are within 1% (confirmed across multiple runs of 20 seconds each): mm-stable: Average: 0.004997 + change: Average: 0.005037 (these numbers are for Ampere Altra. I also tested on M2 VM - no regression their either). Do you still have a concern about this change? An alternative is to store the folio size in the cluster, but that won't be accurate if the folio is later split or if an entry within the cluster is later stolen for an order-0 entry. I think it would still work though; it just means that you might get a false positive in those circumstances, which means taking the "slow" path. But this is a rare event. Regardless, I prefer not to do this, since it adds complexity and doesn't benefit performance. Thanks, Ryan