From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3D048C021B2
	for <linux-mm@archiver.kernel.org>; Tue, 25 Feb 2025 10:15:59 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8E1516B007B; Tue, 25 Feb 2025 05:15:58 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 869D56B0082; Tue, 25 Feb 2025 05:15:58 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 6E4176B0085; Tue, 25 Feb 2025 05:15:58 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 4A67A6B007B
	for <linux-mm@kvack.org>; Tue, 25 Feb 2025 05:15:58 -0500 (EST)
Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id B90EF120FEC
	for <linux-mm@kvack.org>; Tue, 25 Feb 2025 10:15:57 +0000 (UTC)
X-FDA: 83158061154.05.1EF5774
Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132])
	by imf05.hostedemail.com (Postfix) with ESMTP id 52FF4100002
	for <linux-mm@kvack.org>; Tue, 25 Feb 2025 10:15:53 +0000 (UTC)
Authentication-Results: imf05.hostedemail.com;
	dkim=pass header.d=linux.alibaba.com header.s=default header.b=M+CzaRkB;
	spf=pass (imf05.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.132 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com;
	dmarc=pass (policy=none) header.from=linux.alibaba.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1740478555;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=G1//MzoBhkItqh36YQVUEyzPgpeaLo06j7nacJ5WR2A=;
	b=aEwIQQTlLtBIjo/76mwhYAZ+YDT+k1gDLN+39oZ07sTH6teRlorQ2Q7VWwavu1t7uiRBp4
	T2nqJtamhjvSfFcnlI1De18V80RhY6PsD58taPbDBRasNOXCbm6DmBVXErH0hKtkVXPBIZ
	mIYOwyve4PsfqYVb4rrzRSYvMqqlwQk=
ARC-Authentication-Results: i=1;
	imf05.hostedemail.com;
	dkim=pass header.d=linux.alibaba.com header.s=default header.b=M+CzaRkB;
	spf=pass (imf05.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.132 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com;
	dmarc=pass (policy=none) header.from=linux.alibaba.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740478555; a=rsa-sha256;
	cv=none;
	b=17Pc6Cj3bY7rgB/qTN+qm7EVQfnwjcne1j1Al+W4SyGBp4n9dQYBQ7EmxuBFlJd7V00NCQ
	FfOJm1GFO5DwXMysHp+MzwkQ1/9+Cj7pxAuI4ZxKE4YnXBxWAjmfYjD+qTQn22UfTmqM8q
	ApIr4SotKgDJpTfSBVqEjmq9dNYy1aE=
DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=linux.alibaba.com; s=default;
	t=1740478551; h=Message-ID:Date:MIME-Version:Subject:From:To:Content-Type;
	bh=G1//MzoBhkItqh36YQVUEyzPgpeaLo06j7nacJ5WR2A=;
	b=M+CzaRkBnrUgi6OVC1Phj6140MfSc5wRPUUmMyAsknU98QUyCN/ScbbThEVf8+RdT2659RBTpOfCj7fS95R/5f2ZXzgYjQe9fW5kaPzYrjgPhHpGI2B+Fe81GA/aKgvWzFAu7T/F6QBv80/qQDEXo+fVV7Muc1Ubp7cxOrZri5E=
Received: from 30.74.144.116(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WQEMMgT_1740478548 cluster:ay36)
          by smtp.aliyun-inc.com;
          Tue, 25 Feb 2025 18:15:48 +0800
Message-ID: <af6122b4-2324-418b-b925-becf6036d9ab@linux.alibaba.com>
Date: Tue, 25 Feb 2025 18:15:47 +0800
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in
 shmem_split_large_entry()
From: Baolin Wang <baolin.wang@linux.alibaba.com>
To: Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
 Matthew Wilcox <willy@infradead.org>, Hugh Dickins <hughd@google.com>,
 Kairui Song <kasong@tencent.com>, Miaohe Lin <linmiaohe@huawei.com>,
 linux-kernel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>
References: <20250218235444.1543173-1-ziy@nvidia.com>
 <20250218235444.1543173-3-ziy@nvidia.com>
 <f899d6b3-e607-480b-9acc-d64dfbc755b5@linux.alibaba.com>
 <AD348832-5A6A-48F1-9735-924F144330F7@nvidia.com>
 <47d189c7-3143-4b59-a3af-477d4c46a8a0@linux.alibaba.com>
 <2e4b9927-562d-4cfa-9362-f23e3bcfc454@linux.alibaba.com>
 <42440332-96FF-4ECB-8553-9472125EB33F@nvidia.com>
 <37C4B6AC-0757-4B92-88F3-75F1B4DEFFC5@nvidia.com>
 <655589D4-7E13-4F5B-8968-3FCB71DCE0FC@nvidia.com>
 <bd30dc5e-880c-4daf-a86b-b814a1533931@linux.alibaba.com>
In-Reply-To: <bd30dc5e-880c-4daf-a86b-b814a1533931@linux.alibaba.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Rspamd-Queue-Id: 52FF4100002
X-Stat-Signature: ewzkhwkst4hun1ctx7mu9gp584jwht4a
X-Rspam-User: 
X-Rspamd-Server: rspam01
X-HE-Tag: 1740478553-890681
X-HE-Meta: U2FsdGVkX18ig+vniJK/Qh6ZlHtUI66FIXN26qfBrr4D/9T0cMvI1HNv4Qx2T9KjZjqPnMBrVeN1S6GBw1OrHJuWggfXnpAeCARF1EEcNllSj8xWTvGjajnK4YN1DXweg58KYM0LzbjW+ysVZDEdO18NOtWYW+gpJpILF/r2eWKe59gNTV9m4yiRD2htJt1v9aaf0MgDdLXLX1wkpjO5wDPv795TeeY78AxbZcY4sdrhvxKchCV9Op89PqWoQnvBy69HiXzuUNxYrKkVOHst90wlCnujJO6MzQyZVGQIurcXqraIx70R/+z8GlyxbQg502tD5RU/9GXkqfTxeZ5dh0elkYWIukeMlt5TTxgFBSXm7MFipG3qlk/8pdQSLekW5+eOBrdfXY7itr/sBGKexPdt9Fmmz4BG5DPn1XImGdiKvyQTGs0u5I2TfuHxJ3xbm8J4KQxzbUyrJF2zhIY3cfJQDNdlI1uMljYomSRNzkD69aopC69bmVDYbI3QKl9LTsu3fLNH6pAuGfpezEafXNzv0BLuDIYfrbLuliF9mbM/zlSMN3n4JFpM6+aFDK22Mewl8Zet+URLf+PxoV9S56Uv9UcFZhEC5Paz5dbZMDFNYQ2+md2d40Vy2xj3l1I5i/P4baYf4swUHuHRkeEzVxLc/KxTsYV2U894cKxLKaAnaPdRcZOkYhm0i2t3wOnuNi3ldEDq/aExgH2t3eF9wEl5H76mdwGZRxoDaOD037bbYetHMrsUdtsqr1P6ml4h/rjuI+rOdrZeL1Sd5VkXL6+J2DnjrX0nYgUJAeJbnDdUovfEa6Qb4HqLvtPnHRW8vM+tYI0H5NAFlMnE4pU4G+KEQTXWXRB3ubZqcowUsYegnlprtp1jpDvAKoHqWCWKy6ri1XQ/1vodwe81MkB5XmDUK/Y+I+OWwBt49i6oMZKtAVHYDQn5sT3bFEOI85RkNhX39ybYhXnlATyWqW7
 dp3duS8h
 LZSgMcokS7owdqKUe+5N22WGrbTEqvMWz2TQQ0yRiRkRCraEEKK7yNs6FWaKOZVHogDLtB7ywVsyI9MnA9uH5BhRBcRfzfD9JAMBWus2/kg8xtAAQ163ekQo2iMBhPDfW5tO0jLvLamjkKPxp+hHhmcPzpKPjtbrKsX27s/FRRX/O4qbJjZLTZFP6isl68cQ5N33+MBzEbN1E4hU8O/7i+HIN88H5fhR69RdIe0RzwggC2piQhKRMqsed5OURHxe/BkNxJtZBYgD6DybjgIsCQ5UuZz6fb7iI7EcgIJQ0SIWLsvB5+L+uN1sH9AmgakLdN+uj
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 2025/2/25 17:20, Baolin Wang wrote:
> 
> 
> On 2025/2/21 10:38, Zi Yan wrote:
>> On 20 Feb 2025, at 21:33, Zi Yan wrote:
>>
>>> On 20 Feb 2025, at 8:06, Zi Yan wrote:
>>>
>>>> On 20 Feb 2025, at 4:27, Baolin Wang wrote:
>>>>
>>>>> On 2025/2/20 17:07, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2025/2/20 00:10, Zi Yan wrote:
>>>>>>> On 19 Feb 2025, at 5:04, Baolin Wang wrote:
>>>>>>>
>>>>>>>> Hi Zi,
>>>>>>>>
>>>>>>>> Sorry for the late reply due to being busy with other things:)
>>>>>>>
>>>>>>> Thank you for taking a look at the patches. :)
>>>>>>>
>>>>>>>>
>>>>>>>> On 2025/2/19 07:54, Zi Yan wrote:
>>>>>>>>> During shmem_split_large_entry(), large swap entries are 
>>>>>>>>> covering n slots
>>>>>>>>> and an order-0 folio needs to be inserted.
>>>>>>>>>
>>>>>>>>> Instead of splitting all n slots, only the 1 slot covered by 
>>>>>>>>> the folio
>>>>>>>>> need to be split and the remaining n-1 shadow entries can be 
>>>>>>>>> retained with
>>>>>>>>> orders ranging from 0 to n-1.  This method only requires
>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
>>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
>>>>>>>>> xas_split_alloc() + xas_split() one.
>>>>>>>>>
>>>>>>>>> For example, to split an order-9 large swap entry (assuming 
>>>>>>>>> XA_CHUNK_SHIFT
>>>>>>>>> is 6), 1 xa_node is needed instead of 8.
>>>>>>>>>
>>>>>>>>> xas_try_split_min_order() is used to reduce the number of calls to
>>>>>>>>> xas_try_split() during split.
>>>>>>>>
>>>>>>>> For shmem swapin, if we cannot swap in the whole large folio by 
>>>>>>>> skipping the swap cache, we will split the large swap entry 
>>>>>>>> stored in the shmem mapping into order-0 swap entries, rather 
>>>>>>>> than splitting it into other orders of swap entries. This is 
>>>>>>>> because the next time we swap in a shmem folio through 
>>>>>>>> shmem_swapin_cluster(), it will still be an order 0 folio.
>>>>>>>
>>>>>>> Right. But the swapin is one folio at a time, right? 
>>>>>>> shmem_split_large_entry()
>>>>>>
>>>>>> Yes, now we always swapin an order-0 folio from the async swap 
>>>>>> device at a time. However, for sync swap device, we will skip the 
>>>>>> swapcache and swapin the whole large folio by commit 1dd44c0af4fa, 
>>>>>> so it will not call shmem_split_large_entry() in this case.
>>>>
>>>> Got it. I will check the commit.
>>>>
>>>>>>
>>>>>>> should split the large swap entry and give you a slot to store 
>>>>>>> the order-0 folio.
>>>>>>> For example, with an order-9 large swap entry, to swap in first 
>>>>>>> order-0 folio,
>>>>>>> the large swap entry will become order-0, order-0, order-1, 
>>>>>>> order-2,… order-8,
>>>>>>> after the split. Then the first order-0 swap entry can be used.
>>>>>>> Then, when a second order-0 is swapped in, the second order-0 can 
>>>>>>> be used.
>>>>>>> When the last order-0 is swapped in, the order-8 would be split to
>>>>>>> order-7,order-6,…,order-1,order-0, order-0, and the last order-0 
>>>>>>> will be used.
>>>>>>
>>>>>> Yes, understood. However, for the sequential swapin scenarios, 
>>>>>> where originally only one split operation is needed. However, your 
>>>>>> approach increases the number of split operations. Of course, I 
>>>>>> understand that in non-sequential swapin scenarios, your patch 
>>>>>> will save some xarray memory. It might be necessary to evaluate 
>>>>>> whether the increased split operations will have a significant 
>>>>>> impact on the performance of sequential swapin?
>>>>
>>>> Is there a shmem swapin test I can run to measure this? 
>>>> xas_try_split() should
>>>> performance similar operations as existing 
>>>> xas_split_alloc()+xas_split().
>>>>
>>>>>>
>>>>>>> Maybe the swapin assumes after shmem_split_large_entry(), all 
>>>>>>> swap entries
>>>>>>> are order-0, which can lead to issues. There should be some check 
>>>>>>> like
>>>>>>> if the swap entry order > folio_order, shmem_split_large_entry() 
>>>>>>> should
>>>>>>> be used.
>>>>>>>>
>>>>>>>> Moreover I did a quick test with swapping in order 6 shmem 
>>>>>>>> folios, however, my test hung, and the console was continuously 
>>>>>>>> filled with the following information. It seems there are some 
>>>>>>>> issues with shmem swapin handling. Anyway, I need more time to 
>>>>>>>> debug and test.
>>>>>>> To swap in order-6 folios, shmem_split_large_entry() does not 
>>>>>>> allocate
>>>>>>> any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM
>>>>>>> error below. Let me know if there is anything I can help.
>>>>>>
>>>>>> I encountered some issues while testing order 4 and order 6 swapin 
>>>>>> with your patches. And I roughly reviewed the patch, and it seems 
>>>>>> that the new swap entry stored in the shmem mapping was not 
>>>>>> correctly updated after the split.
>>>>>>
>>>>>> The following logic is to reset the swap entry after split, and I 
>>>>>> assume that the large swap entry is always split to order 0 
>>>>>> before. As your patch suggests, if a non-uniform split is used, 
>>>>>> then the logic for resetting the swap entry needs to be changed? 
>>>>>> Please correct me if I missed something.
>>>>>>
>>>>>> /*
>>>>>>    * Re-set the swap entry after splitting, and the swap
>>>>>>    * offset of the original large entry must be continuous.
>>>>>>    */
>>>>>> for (i = 0; i < 1 << order; i++) {
>>>>>>       pgoff_t aligned_index = round_down(index, 1 << order);
>>>>>>       swp_entry_t tmp;
>>>>>>
>>>>>>       tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>>>>>>       __xa_store(&mapping->i_pages, aligned_index + i,
>>>>>>              swp_to_radix_entry(tmp), 0);
>>>>>> }
>>>>
>>>> Right. I will need to adjust swp_entry_t. Thanks for pointing this out.
>>>>
>>>>>
>>>>> In addition, after your patch, the shmem_split_large_entry() seems 
>>>>> always return 0 even though it splits a large swap entry, but we 
>>>>> still need re-calculate the swap entry value after splitting, 
>>>>> otherwise it may return errors due to shmem_confirm_swap() 
>>>>> validation failure.
>>>>>
>>>>> /*
>>>>>   * If the large swap entry has already been split, it is
>>>>>   * necessary to recalculate the new swap entry based on
>>>>>   * the old order alignment.
>>>>>   */
>>>>>   if (split_order > 0) {
>>>>>     pgoff_t offset = index - round_down(index, 1 << split_order);
>>>>>
>>>>>     swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
>>>>> }
>>>>
>>>> Got it. I will fix it.
>>>>
>>>> BTW, do you mind sharing your swapin tests so that I can test my new 
>>>> version
>>>> properly?
>>>
>>> The diff below adjusts the swp_entry_t and returns the right order after
>>> shmem_split_large_entry(). Let me know if it fixes your issue.
>>
>> Fixed the compilation error. It will be great if you can share a 
>> swapin test, so that
>> I can test locally. Thanks.
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index b35ba250c53d..bfc4ef511391 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -2162,7 +2162,7 @@ static int shmem_split_large_entry(struct inode 
>> *inode, pgoff_t index,
>>   {
>>       struct address_space *mapping = inode->i_mapping;
>>       XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
>> -    int split_order = 0;
>> +    int split_order = 0, entry_order = 0;
>>       int i;
>>
>>       /* Convert user data gfp flags to xarray node gfp flags */
>> @@ -2180,6 +2180,7 @@ static int shmem_split_large_entry(struct inode 
>> *inode, pgoff_t index,
>>           }
>>
>>           order = xas_get_order(&xas);
>> +        entry_order = order;
>>
>>           /* Try to split large swap entry in pagecache */
>>           if (order > 0) {
>> @@ -2192,23 +2193,23 @@ static int shmem_split_large_entry(struct 
>> inode *inode, pgoff_t index,
>>                   xas_try_split(&xas, old, cur_order, GFP_NOWAIT);
>>                   if (xas_error(&xas))
>>                       goto unlock;
>> +
>> +                /*
>> +                 * Re-set the swap entry after splitting, and the swap
>> +                 * offset of the original large entry must be 
>> continuous.
>> +                 */
>> +                for (i = 0; i < 1 << cur_order; i += (1 << 
>> split_order)) {
>> +                    pgoff_t aligned_index = round_down(index, 1 << 
>> cur_order);
>> +                    swp_entry_t tmp;
>> +
>> +                    tmp = swp_entry(swp_type(swap), swp_offset(swap) 
>> + i);
>> +                    __xa_store(&mapping->i_pages, aligned_index + i,
>> +                           swp_to_radix_entry(tmp), 0);
>> +                }
>>                   cur_order = split_order;
>>                   split_order =
>>                       xas_try_split_min_order(split_order);
>>               }
> 
> This looks incorrect to me. Suppose we are splitting an order-9 swap 
> entry, in the first iteration of the loop, it splits the order-9 swap 
> entry into 8 order-6 swap entries. At this point, the order-6 swap 
> entries are reset, and everything seems fine.
> 
> However, in the second iteration, where an order-6 swap entry is split 
> into 63 order-0 swap entries, the split operation itself is correct. But 

typo: 64

> when resetting the order-0 swap entry, it seems incorrect. Now the 
> 'cur_order' = 6 and 'split_order' = 0, which means the range for the 
> reset index is always between 0 and 63 (see __xa_store()).

Sorry for confusing. The 'aligned_index' will be rounded down by 
'cur_order' (which is 6), so the index is correct. But the swap offset 
calculated by 'swp_offset(swap) + i' looks incorrect, cause the 'i' is 
always between 0 and 63.

>  > +for (i = 0; i < 1 << cur_order; i += (1 << split_order)) {
>  > +    pgoff_t aligned_index = round_down(index, 1 << cur_order);
>  > +    swp_entry_t tmp;
>  > +
>  > +    tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>  > +    __xa_store(&mapping->i_pages, aligned_index + i,
>  > +        swp_to_radix_entry(tmp), 0);
>  > +}
> 
> However, if the index is greater than 63, it appears that it is not set 
> to the correct new swap entry after splitting. Please corect me if I 
> missed anyting.
> 
>> -
>> -            /*
>> -             * Re-set the swap entry after splitting, and the swap
>> -             * offset of the original large entry must be continuous.
>> -             */
>> -            for (i = 0; i < 1 << order; i++) {
>> -                pgoff_t aligned_index = round_down(index, 1 << order);
>> -                swp_entry_t tmp;
>> -
>> -                tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
>> -                __xa_store(&mapping->i_pages, aligned_index + i,
>> -                       swp_to_radix_entry(tmp), 0);
>> -            }
>>           }
>>
>>   unlock:
>> @@ -2221,7 +2222,7 @@ static int shmem_split_large_entry(struct inode 
>> *inode, pgoff_t index,
>>       if (xas_error(&xas))
>>           return xas_error(&xas);
>>
>> -    return split_order;
>> +    return entry_order;
>>   }
>>
>>   /*
>>
>>
>> Best Regards,
>> Yan, Zi