From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EE647C5478C for ; Mon, 4 Mar 2024 16:04:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 642706B0083; Mon, 4 Mar 2024 11:04:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5F2D86B0088; Mon, 4 Mar 2024 11:04:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4BA336B0089; Mon, 4 Mar 2024 11:04:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 356236B0083 for ; Mon, 4 Mar 2024 11:04:03 -0500 (EST) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id DBEEAC0143 for ; Mon, 4 Mar 2024 16:04:02 +0000 (UTC) X-FDA: 81859827924.19.376733F Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf23.hostedemail.com (Postfix) with ESMTP id 66E62140004 for ; Mon, 4 Mar 2024 16:03:59 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf23.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709568239; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Q68y5hlS/4MXA2Sm5/6yPkqdI0R0ZibtOp47dBfzCeM=; b=o1SmRYTQvG6+D75KFiyoetxJfaesgZoapuHZ7+B/Od/sNqLe7QvtUPiblGMv1Wmv2fUzbf uBUF4LHtqm7isPfz5pWDXjCk0J6YwMp5cMAnwWPaS++wUZyRQH2fwfZK8CEE8wQGl/273u 2czbeO+NBt//z0j3kAwOT9ECG+W/Hug= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf23.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709568239; a=rsa-sha256; cv=none; b=J1QvArj8hHTj/WoBO6QjOP7JXLznhjniD7BbGXmUdKBr7S7x4gir3c8uKzsl0EFP06oU3O APtYpJfmzeW4/T6qcZ1AqQHJm4pMbqqYSubXit4ugnYdQSf8KG9a5vcvrWdk69EIed2LOl 5QCZBWK/bMaRWeHzNBHVcU89JrV6/l0= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id C586E1FB; Mon, 4 Mar 2024 08:04:34 -0800 (PST) Received: from [10.57.68.92] (unknown [10.57.68.92]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 13BA23F73F; Mon, 4 Mar 2024 08:03:54 -0800 (PST) Message-ID: Date: Mon, 4 Mar 2024 16:03:53 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Content-Language: en-GB To: David Hildenbrand , Andrew Morton , Matthew Wilcox , Huang Ying , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20231025144546.577640-1-ryan.roberts@arm.com> <20231025144546.577640-2-ryan.roberts@arm.com> <6541e29b-f25a-48b8-a553-fd8febe85e5a@redhat.com> <2934125a-f2e2-417c-a9f9-3cb1e074a44f@redhat.com> <049818ca-e656-44e4-b336-934992c16028@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 66E62140004 X-Stat-Signature: ih669uj6rg136jucp8fx51wkf59i8j8n X-Rspam-User: X-HE-Tag: 1709568239-437157 X-HE-Meta: U2FsdGVkX19SFglimxPQWyjhE+7hzXGtcYTIzNv1WozlCBqIeQV8AV5mwUKjEqn1kWO6SA4mg+bwPVY/kmQCOjmqn6VEL5IAz9P1KiJDxv4t0t4pJI9D2+dC3ypGde3zHYbq7O+Xpb+A0Z9vZYmhN7h0kzsi08sI46gSxaGBR+ZAr7vVVE0VBaU4q9tpnVFLHUE8Qcb+jaq66XI6jfIb1shOHgdH4Jqffl6AnXtnSVMTNTFx8HDugCPkbyY/7DepQsC+2xUizCRyVIcmDnUTdmw1rnLrB4bhONHGY3hnmdaDyGlmy5bJr17Wqk8nA2ae4cKRcQhZY/X/iyjwCwYTBnKM7HYOPkOIFaMhkdnNH5IAhWy8kRQnmruV0T1TQgv+aTHXIwxtJyM5kO0FLTMiasnS0vTxPbXfNGApwYr3M9V8PQSoepPuS1QiuL5VwtuUQm1/UHRj58yfIKaWbdi3i8+1/pYKcvE4igwDD+g6amxL65skMwiqDoLWFXSb2I1gIuXG/Zrn3MTvOKkJzgTKHc4ntdC6NkNSXaeBmGvYHNd3RI61UYBOuTLIn15WVhAT5P3xlub1/lnERkbGqhHjJScGPWw35C9KQgjQtqtnMiYSzHKm+Ponb28eVQpUXLGfDtJy9gsGFc3E+V6rLPnr3jlxu1/6nYYmLcKbw/FZwFRiqTT6e12y++wkTyZsQoXphfzw8HQHLjoPCSTmmOY757IPQWbzWxuMp1JThlfufS/p2m3+6cBlmnCEZhRORmyRn/yLHr3nxaJZy+E1ULi7y0PcgBb9b3JpZcRtDQ8j8BFr3KjJ7Ti2zCbxa/cXkgFsafJalgYL9dicP5fPk6GJ+Qhmn+TmmjANnkmGO6/4F4x9wHX8W3H8uajC8lw9aqI9SmTkNgBmhBJP5DjOt1AIZVyNWimagUoYniOFTaCcTGYFpZgi1yM+4WlEz8QIE35QKm2UTwu/YOlWsMZM08Q oC3JJFih 6o3Tybx+aoJ+bd4sMHBexEVlyig8anRiXnusXIj/Gr1QYJhwJ5ViYDnoceMvgBOWda/BCrwnWEecyAT4X84Q90eipStR4y58WSogqweOX7AL+B7VpohREJcBPqzgkfi2thBmD0O5actjPi9iVMk3gb/+9uLZ+qtXqlR/N8TKbafD5i70L+lsPpvYZDyDrW5q9LbSZ3hkgxJsCsb1FsLDDpoZmi0RmdfZHHAQFy5bEs6LMPfJow0EiManEoJ5jQpUUmcY6 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 28/02/2024 12:12, David Hildenbrand wrote: >>> How relevant is it? Relevant enough that someone decided to put that >>> optimization in? I don't know :) >> >> I'll have one last go at convincing you: Huang Ying (original author) commented >> "I believe this should be OK.  Better to compare the performance too." at [1]. >> That implies to me that perhaps the optimization wasn't in response to a >> specific problem after all. Do you have any thoughts, Huang? > > Might make sense to include that in the patch description! > >> OK so if we really do need to keep this optimization, here are some ideas: >> >> Fundamentally, we would like to be able to figure out the size of the swap slot >> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For >> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster >> to mark it as PMD_SIZE. >> >> Going forwards, we want to support all sizes (power-of-2). Most of the time, a >> cluster will contain only one size of THPs, but this is not the case when a THP >> in the swapcache gets split or when an order-0 slot gets stolen. We expect these >> cases to be rare. >> >> 1) Keep the size of the smallest swap entry in the cluster header. Most of the >> time it will be the full size of the swap entry, but sometimes it will cover >> only a portion. In the latter case you may see a false negative for >> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare. >> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We >> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster >> to order-0). I think that is safe, but haven't completely convinced myself yet. >> >> 2) allocate 4 bits per (small) swap slot to hold the order. This will give >> precise information and is conceptually simpler to understand, but will cost >> more memory (half as much as the initial swap_map[] again). >> >> I still prefer to avoid this at all if we can (and would like to hear Huang's >> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some >> prototyping. > > Taking a step back: what about we simply batch unmapping of swap entries? > > That is, if we're unmapping a PTE range, we'll collect swap entries (under PT > lock) that reference consecutive swap offsets in the same swap file. > > There, we can then first decrement all the swap counts, and then try minimizing > how often we actually have to try reclaiming swap space (lookup folio, see it's > a large folio that we cannot reclaim or could reclaim, ...). > > Might need some fine-tuning in swap code to "advance" to the next entry to try > freeing up, but we certainly can do better than what we would do right now. > Hi, I'm struggling to convince myself that free_swap_and_cache() can't race with with swapoff(). Can anyone explain that this is safe? I *think* they are both serialized by the PTL, since all callers of free_swap_and_cache() (except shmem) have the PTL, and swapoff() calls try_to_unuse() early on, which takes the PTL as it iterates over every vma in every mm. It looks like shmem is handled specially by a call to shmem_unuse(), but I can't see the exact serialization mechanism. I've implemented a batching function, as David suggested above, but I'm trying to convince myself that it is safe for it to access si->swap_map[] without a lock (i.e. that swapoff() can't concurrently free it). But I think free_swap_and_cache() as it already exists depends on being able to access the si without an explicit lock, so I'm assuming the same mechanism will protect my new changes. But I want to be sure I understand the mechanism... This is the existing free_swap_and_cache(). I think _swap_info_get() would break if this could race with swapoff(), and __swap_entry_free() looks up the cluster from an array, which would also be freed by swapoff if racing: int free_swap_and_cache(swp_entry_t entry) { struct swap_info_struct *p; unsigned char count; if (non_swap_entry(entry)) return 1; p = _swap_info_get(entry); if (p) { count = __swap_entry_free(p, entry); if (count == SWAP_HAS_CACHE) __try_to_reclaim_swap(p, swp_offset(entry), TTRS_UNMAPPED | TTRS_FULL); } return p != NULL; } This is my new function. I want to be sure that it's safe to do the READ_ONCE(si->swap_info[...]): void free_swap_and_cache_nr(swp_entry_t entry, int nr) { unsigned long end = swp_offset(entry) + nr; unsigned type = swp_type(entry); struct swap_info_struct *si; unsigned long offset; if (non_swap_entry(entry)) return; si = _swap_info_get(entry); if (!si || end > si->max) return; /* * First free all entries in the range. */ for (offset = swp_offset(entry); offset < end; offset++) { VM_WARN_ON(data_race(!si->swap_map[offset])); __swap_entry_free(si, swp_entry(type, offset)); } /* * Now go back over the range trying to reclaim the swap cache. This is * more efficient for large folios because we will only try to reclaim * the swap once per folio in the common case. If we do * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the * latter will get a reference and lock the folio for every individual * page but will only succeed once the swap slot for every subpage is * zero. */ for (offset = swp_offset(entry); offset < end; offset += nr) { nr = 1; if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) { << HERE /* * Folios are always naturally aligned in swap so * advance forward to the next boundary. Zero means no * folio was found for the swap entry, so advance by 1 * in this case. Negative value means folio was found * but could not be reclaimed. Here we can still advance * to the next boundary. */ nr = __try_to_reclaim_swap(si, offset, TTRS_UNMAPPED | TTRS_FULL); if (nr == 0) nr = 1; else if (nr < 0) nr = -nr; nr = ALIGN(offset + 1, nr) - offset; } } } Thanks, Ryan