From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1F3C3CDB482 for ; Thu, 19 Oct 2023 14:25:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 782C280096; Thu, 19 Oct 2023 10:25:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 732EE8D009F; Thu, 19 Oct 2023 10:25:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 64A1880096; Thu, 19 Oct 2023 10:25:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 551CC8D009F for ; Thu, 19 Oct 2023 10:25:47 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id E6B6012044A for ; Thu, 19 Oct 2023 14:25:45 +0000 (UTC) X-FDA: 81362434650.23.1DAC8CE Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf11.hostedemail.com (Postfix) with ESMTP id 94F6240028 for ; Thu, 19 Oct 2023 14:25:43 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=none; spf=pass (imf11.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697725544; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=hz60r7leSjplN1nMDhtnOzgZSnkvY55+t83DGkjq8p0=; b=8Kjd9zIHvK1zqztBD0zOhU4z9g5Ak4eT417BduxgCeeUbpRN+z2I1iQBwQESxEwFfBY/+6 3RzCalzKbJ7ijAf18TKe0KYuGPxVQl3LDZZ2aPSSrMFNnTiaQAeTZv4pnLHLia9mqJL40m WGNqq2bXK97CkX7Acznucb/5AMhwcWs= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697725544; a=rsa-sha256; cv=none; b=NzsvkTDsXHLCFDxXmCf5XpufShvM238z/riPG31L5UHElJGmevc/U0/wRhAdgHSMo3QPUr 1yu9uE9+K/KEqRooBZoHYk03+GSUJSOnsxRbMY0jxk2Aq2EzTcWzjv1J+3QgCiIcS7TIza tT1nrpSC6iBcCDjP6ImWck44dZfq/7o= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=none; spf=pass (imf11.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 659A62F4; Thu, 19 Oct 2023 07:26:23 -0700 (PDT) Received: from [10.57.68.54] (unknown [10.57.68.54]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B2F723F762; Thu, 19 Oct 2023 07:25:40 -0700 (PDT) Message-ID: <4b351eaa-c2db-408b-9ce2-4b12e3d6b30a@arm.com> Date: Thu, 19 Oct 2023 15:25:39 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 2/2] mm: swap: Swap-out small-sized THP without splitting Content-Language: en-GB To: "Huang, Ying" Cc: Andrew Morton , David Hildenbrand , Matthew Wilcox , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Tim Chen References: <20231017161302.2518826-1-ryan.roberts@arm.com> <20231017161302.2518826-3-ryan.roberts@arm.com> <87r0ls773p.fsf@yhuang6-desk2.ccr.corp.intel.com> <87mswfuppr.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Ryan Roberts In-Reply-To: <87mswfuppr.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Stat-Signature: 9dype37qkctge5yxq96n81qk3nf1fd1x X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 94F6240028 X-Rspam-User: X-HE-Tag: 1697725543-145088 X-HE-Meta: U2FsdGVkX1/p53ZmNMbhoXfkQwVy6u99uQ9WVT3bM4Eue2epOi2a28aChqg4/BOoaxvNnNFYlq4ejCMmQLdDH8ze1SzLPIlQrR7wUcv4dGBI67jPxuj0vjYV60fCAeLXE6OmkGfydgYn/y7rrHk2BHo1fiGiSp2e4CgEcNH5l+6yT0pgPXYjRF8LSfJR2ymYuLgENhW6PxU5jaUkJG+nPZp+4zxIB9cUvgcsYNcH8B7tfMELtL7beKKN3xT7yhvbZQPFF7p1vAb20BO8dslLTFwkv2xAJFkIPgAG6F2NBRk3ECj4mbwiyueXhLfMf81TsIDXsXuVm230K/GfkgL7sNI010+mU9tamGUyAVxjspoQnRTdWKCYnc+hawzcIrdsxSIGZ3udyM2gzABW2H1lydNi1hLQj2yEqg2JMiKN5lhYtRk30uxGwXuxrQWE6P+C7QR6kHaY1jG79M+M6zVXtrEXeh1R9FM69y3cE7m+OODEPBqMGZ9JoC1fiANBhlM36zQ8X8ybKX0GxL5Vw19aiQe1P8N25UmnZYpnY5EmKQ/fqS99MLyYMcVULEeGygNfhq3rlbjLZ3AvLx4lLTRMOZYYnOFUsN3udfvaItP4pZqoiFwjvTSXO02YPW3LK/JfnavG9HKgRXpYzsr0caEoWmk0v0ef2+owtjzHM8DTYiBt7IVZBjPXEaSSdQY8qmC6RY0Q9Pz/I2V4NlRuKQ7DdTptz3zfIHKHFS5mWMOQfAahN6pfEnuRqdnLjsPZww6eXStnVQbBQoGLlOAThLI9TnI36TS0JPQDbg0D+z3j0J9OSihTIBrLCcy3QhYqpo/9+aT0bzCEMfV4o044Lx/neRh4Fnv8hUps9XqZh7rr8oolhxv8+pzahkhiJfkx3tv65+cSZKkbv9KPoTDdnDeuG2aKeOwHXxQzpfqEwSJU16TyFZwpdb0jxT9gNkFKHhJyA7PVsnylolCLlwuAwGe jZw9te6z ko51pUd9QXJ+fQ5Wq9+/vXPmt1xdqpAlaHzzSG4Om77DOs9QKIZHkZDWxUe52rzUTJd8K2iCE+1vIvfe3AjxyR/b+eLrki7sdaSyeAO7HuQ3QOoKzAu6cTcBwh0Pwrd40LNqP4lSv7AM17IRCWl+TUTs3kEKquZIIJgnIASWVrEo2ajuhmg/z9Z6PQLX0rHso6eKM9I9BDyStu8U= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 19/10/2023 06:49, Huang, Ying wrote: > Ryan Roberts writes: > >> On 18/10/2023 07:55, Huang, Ying wrote: >>> Ryan Roberts writes: >>> > > [snip] > >>>> diff --git a/include/linux/swap.h b/include/linux/swap.h >>>> index a073366a227c..35cbbe6509a9 100644 >>>> --- a/include/linux/swap.h >>>> +++ b/include/linux/swap.h >>>> @@ -268,6 +268,12 @@ struct swap_cluster_info { >>>> struct percpu_cluster { >>>> struct swap_cluster_info index; /* Current cluster index */ >>>> unsigned int next; /* Likely next allocation offset */ >>>> + unsigned int large_next[]; /* >>>> + * next free offset within current >>>> + * allocation cluster for large folios, >>>> + * or UINT_MAX if no current cluster. >>>> + * Index is (order - 1). >>>> + */ >>>> }; >>>> >>>> struct swap_cluster_list { >>>> diff --git a/mm/swapfile.c b/mm/swapfile.c >>>> index b83ad77e04c0..625964e53c22 100644 >>>> --- a/mm/swapfile.c >>>> +++ b/mm/swapfile.c >>>> @@ -987,35 +987,70 @@ static int scan_swap_map_slots(struct swap_info_struct *si, >>>> return n_ret; >>>> } >>>> >>>> -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) >>>> +static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot, >>>> + unsigned int nr_pages) >>> >>> This looks hacky. IMO, we should put the allocation logic inside >>> percpu_cluster framework. If percpu_cluster framework doesn't work for >>> you, just refactor it firstly. >> >> I'm not sure I really understand what you are suggesting - could you elaborate? >> What "framework"? I only see a per-cpu data structure and >> scan_swap_map_try_ssd_cluster(), which is very much geared towards order-0 >> allocations. > > I suggest to share as much code as possible between order-0 and order > > 0 swap entry allocation. I think that we can make > scan_swap_map_try_ssd_cluster() works for order > 0 swap entry allocation. > [...] >>>> + /* >>>> + * If scan_swap_map_slots() can't find a free cluster, it will >>>> + * check si->swap_map directly. To make sure this standby >>>> + * cluster isn't taken by scan_swap_map_slots(), mark the swap >>>> + * entries bad (occupied). (same approach as discard). >>>> + */ >>>> + memset(si->swap_map + offset + nr_pages, SWAP_MAP_BAD, >>>> + SWAPFILE_CLUSTER - nr_pages); >>> >>> There's an issue with this solution. If the free space of swap device >>> runs low, it's possible that >>> >>> - some cluster are put in the percpu_cluster of some CPUs >>> the swap entries there are marked as used >>> >>> - no free swap entries elsewhere >>> >>> - nr_swap_pages isn't 0 >>> >>> So, we will still scan LRU, but swap allocation fails, although there's >>> still free swap space. I'd like to decide how best to solve this problem before I can figure out how much code sharing I can do for the order-0 vs order > 0 allocators. I see a couple of potential options: 1) Manipulate nr_swap_pages to include the entries that are marked SWAP_MAP_BAD, so when reserving a new per-order/per-cpu cluster, subtract SWAPFILE_CLUSTER, and then add nr_pages for each allocation from that cluster. 2) Don't mark the entries in the reserved cluster as SWAP_MAP_BAD, which would allow the scanner to steal (order-0) entries. The scanner could set a flag in the cluster info to mark it as having been allocated from by the scanner, so the next attempt to allocate a high order from it would cause discarding it as the cpu's current cluster and trying to get a fresh cluster from the free list. While option 2 is a bit more complex, I prefer it as a solution. What do you think?