From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3D048C021B2 for ; Tue, 25 Feb 2025 10:15:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8E1516B007B; Tue, 25 Feb 2025 05:15:58 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 869D56B0082; Tue, 25 Feb 2025 05:15:58 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6E4176B0085; Tue, 25 Feb 2025 05:15:58 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 4A67A6B007B for ; Tue, 25 Feb 2025 05:15:58 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id B90EF120FEC for ; Tue, 25 Feb 2025 10:15:57 +0000 (UTC) X-FDA: 83158061154.05.1EF5774 Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132]) by imf05.hostedemail.com (Postfix) with ESMTP id 52FF4100002 for ; Tue, 25 Feb 2025 10:15:53 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=M+CzaRkB; spf=pass (imf05.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.132 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740478555; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=G1//MzoBhkItqh36YQVUEyzPgpeaLo06j7nacJ5WR2A=; b=aEwIQQTlLtBIjo/76mwhYAZ+YDT+k1gDLN+39oZ07sTH6teRlorQ2Q7VWwavu1t7uiRBp4 T2nqJtamhjvSfFcnlI1De18V80RhY6PsD58taPbDBRasNOXCbm6DmBVXErH0hKtkVXPBIZ mIYOwyve4PsfqYVb4rrzRSYvMqqlwQk= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=M+CzaRkB; spf=pass (imf05.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.132 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740478555; a=rsa-sha256; cv=none; b=17Pc6Cj3bY7rgB/qTN+qm7EVQfnwjcne1j1Al+W4SyGBp4n9dQYBQ7EmxuBFlJd7V00NCQ FfOJm1GFO5DwXMysHp+MzwkQ1/9+Cj7pxAuI4ZxKE4YnXBxWAjmfYjD+qTQn22UfTmqM8q ApIr4SotKgDJpTfSBVqEjmq9dNYy1aE= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1740478551; h=Message-ID:Date:MIME-Version:Subject:From:To:Content-Type; bh=G1//MzoBhkItqh36YQVUEyzPgpeaLo06j7nacJ5WR2A=; b=M+CzaRkBnrUgi6OVC1Phj6140MfSc5wRPUUmMyAsknU98QUyCN/ScbbThEVf8+RdT2659RBTpOfCj7fS95R/5f2ZXzgYjQe9fW5kaPzYrjgPhHpGI2B+Fe81GA/aKgvWzFAu7T/F6QBv80/qQDEXo+fVV7Muc1Ubp7cxOrZri5E= Received: from 30.74.144.116(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WQEMMgT_1740478548 cluster:ay36) by smtp.aliyun-inc.com; Tue, 25 Feb 2025 18:15:48 +0800 Message-ID: Date: Tue, 25 Feb 2025 18:15:47 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 2/2] mm/shmem: use xas_try_split() in shmem_split_large_entry() From: Baolin Wang To: Zi Yan Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Matthew Wilcox , Hugh Dickins , Kairui Song , Miaohe Lin , linux-kernel@vger.kernel.org, Andrew Morton References: <20250218235444.1543173-1-ziy@nvidia.com> <20250218235444.1543173-3-ziy@nvidia.com> <47d189c7-3143-4b59-a3af-477d4c46a8a0@linux.alibaba.com> <2e4b9927-562d-4cfa-9362-f23e3bcfc454@linux.alibaba.com> <42440332-96FF-4ECB-8553-9472125EB33F@nvidia.com> <37C4B6AC-0757-4B92-88F3-75F1B4DEFFC5@nvidia.com> <655589D4-7E13-4F5B-8968-3FCB71DCE0FC@nvidia.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 52FF4100002 X-Stat-Signature: ewzkhwkst4hun1ctx7mu9gp584jwht4a X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1740478553-890681 X-HE-Meta: U2FsdGVkX18ig+vniJK/Qh6ZlHtUI66FIXN26qfBrr4D/9T0cMvI1HNv4Qx2T9KjZjqPnMBrVeN1S6GBw1OrHJuWggfXnpAeCARF1EEcNllSj8xWTvGjajnK4YN1DXweg58KYM0LzbjW+ysVZDEdO18NOtWYW+gpJpILF/r2eWKe59gNTV9m4yiRD2htJt1v9aaf0MgDdLXLX1wkpjO5wDPv795TeeY78AxbZcY4sdrhvxKchCV9Op89PqWoQnvBy69HiXzuUNxYrKkVOHst90wlCnujJO6MzQyZVGQIurcXqraIx70R/+z8GlyxbQg502tD5RU/9GXkqfTxeZ5dh0elkYWIukeMlt5TTxgFBSXm7MFipG3qlk/8pdQSLekW5+eOBrdfXY7itr/sBGKexPdt9Fmmz4BG5DPn1XImGdiKvyQTGs0u5I2TfuHxJ3xbm8J4KQxzbUyrJF2zhIY3cfJQDNdlI1uMljYomSRNzkD69aopC69bmVDYbI3QKl9LTsu3fLNH6pAuGfpezEafXNzv0BLuDIYfrbLuliF9mbM/zlSMN3n4JFpM6+aFDK22Mewl8Zet+URLf+PxoV9S56Uv9UcFZhEC5Paz5dbZMDFNYQ2+md2d40Vy2xj3l1I5i/P4baYf4swUHuHRkeEzVxLc/KxTsYV2U894cKxLKaAnaPdRcZOkYhm0i2t3wOnuNi3ldEDq/aExgH2t3eF9wEl5H76mdwGZRxoDaOD037bbYetHMrsUdtsqr1P6ml4h/rjuI+rOdrZeL1Sd5VkXL6+J2DnjrX0nYgUJAeJbnDdUovfEa6Qb4HqLvtPnHRW8vM+tYI0H5NAFlMnE4pU4G+KEQTXWXRB3ubZqcowUsYegnlprtp1jpDvAKoHqWCWKy6ri1XQ/1vodwe81MkB5XmDUK/Y+I+OWwBt49i6oMZKtAVHYDQn5sT3bFEOI85RkNhX39ybYhXnlATyWqW7 dp3duS8h LZSgMcokS7owdqKUe+5N22WGrbTEqvMWz2TQQ0yRiRkRCraEEKK7yNs6FWaKOZVHogDLtB7ywVsyI9MnA9uH5BhRBcRfzfD9JAMBWus2/kg8xtAAQ163ekQo2iMBhPDfW5tO0jLvLamjkKPxp+hHhmcPzpKPjtbrKsX27s/FRRX/O4qbJjZLTZFP6isl68cQ5N33+MBzEbN1E4hU8O/7i+HIN88H5fhR69RdIe0RzwggC2piQhKRMqsed5OURHxe/BkNxJtZBYgD6DybjgIsCQ5UuZz6fb7iI7EcgIJQ0SIWLsvB5+L+uN1sH9AmgakLdN+uj X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/2/25 17:20, Baolin Wang wrote: > > > On 2025/2/21 10:38, Zi Yan wrote: >> On 20 Feb 2025, at 21:33, Zi Yan wrote: >> >>> On 20 Feb 2025, at 8:06, Zi Yan wrote: >>> >>>> On 20 Feb 2025, at 4:27, Baolin Wang wrote: >>>> >>>>> On 2025/2/20 17:07, Baolin Wang wrote: >>>>>> >>>>>> >>>>>> On 2025/2/20 00:10, Zi Yan wrote: >>>>>>> On 19 Feb 2025, at 5:04, Baolin Wang wrote: >>>>>>> >>>>>>>> Hi Zi, >>>>>>>> >>>>>>>> Sorry for the late reply due to being busy with other things:) >>>>>>> >>>>>>> Thank you for taking a look at the patches. :) >>>>>>> >>>>>>>> >>>>>>>> On 2025/2/19 07:54, Zi Yan wrote: >>>>>>>>> During shmem_split_large_entry(), large swap entries are >>>>>>>>> covering n slots >>>>>>>>> and an order-0 folio needs to be inserted. >>>>>>>>> >>>>>>>>> Instead of splitting all n slots, only the 1 slot covered by >>>>>>>>> the folio >>>>>>>>> need to be split and the remaining n-1 shadow entries can be >>>>>>>>> retained with >>>>>>>>> orders ranging from 0 to n-1.  This method only requires >>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) * >>>>>>>>> (n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original >>>>>>>>> xas_split_alloc() + xas_split() one. >>>>>>>>> >>>>>>>>> For example, to split an order-9 large swap entry (assuming >>>>>>>>> XA_CHUNK_SHIFT >>>>>>>>> is 6), 1 xa_node is needed instead of 8. >>>>>>>>> >>>>>>>>> xas_try_split_min_order() is used to reduce the number of calls to >>>>>>>>> xas_try_split() during split. >>>>>>>> >>>>>>>> For shmem swapin, if we cannot swap in the whole large folio by >>>>>>>> skipping the swap cache, we will split the large swap entry >>>>>>>> stored in the shmem mapping into order-0 swap entries, rather >>>>>>>> than splitting it into other orders of swap entries. This is >>>>>>>> because the next time we swap in a shmem folio through >>>>>>>> shmem_swapin_cluster(), it will still be an order 0 folio. >>>>>>> >>>>>>> Right. But the swapin is one folio at a time, right? >>>>>>> shmem_split_large_entry() >>>>>> >>>>>> Yes, now we always swapin an order-0 folio from the async swap >>>>>> device at a time. However, for sync swap device, we will skip the >>>>>> swapcache and swapin the whole large folio by commit 1dd44c0af4fa, >>>>>> so it will not call shmem_split_large_entry() in this case. >>>> >>>> Got it. I will check the commit. >>>> >>>>>> >>>>>>> should split the large swap entry and give you a slot to store >>>>>>> the order-0 folio. >>>>>>> For example, with an order-9 large swap entry, to swap in first >>>>>>> order-0 folio, >>>>>>> the large swap entry will become order-0, order-0, order-1, >>>>>>> order-2,… order-8, >>>>>>> after the split. Then the first order-0 swap entry can be used. >>>>>>> Then, when a second order-0 is swapped in, the second order-0 can >>>>>>> be used. >>>>>>> When the last order-0 is swapped in, the order-8 would be split to >>>>>>> order-7,order-6,…,order-1,order-0, order-0, and the last order-0 >>>>>>> will be used. >>>>>> >>>>>> Yes, understood. However, for the sequential swapin scenarios, >>>>>> where originally only one split operation is needed. However, your >>>>>> approach increases the number of split operations. Of course, I >>>>>> understand that in non-sequential swapin scenarios, your patch >>>>>> will save some xarray memory. It might be necessary to evaluate >>>>>> whether the increased split operations will have a significant >>>>>> impact on the performance of sequential swapin? >>>> >>>> Is there a shmem swapin test I can run to measure this? >>>> xas_try_split() should >>>> performance similar operations as existing >>>> xas_split_alloc()+xas_split(). >>>> >>>>>> >>>>>>> Maybe the swapin assumes after shmem_split_large_entry(), all >>>>>>> swap entries >>>>>>> are order-0, which can lead to issues. There should be some check >>>>>>> like >>>>>>> if the swap entry order > folio_order, shmem_split_large_entry() >>>>>>> should >>>>>>> be used. >>>>>>>> >>>>>>>> Moreover I did a quick test with swapping in order 6 shmem >>>>>>>> folios, however, my test hung, and the console was continuously >>>>>>>> filled with the following information. It seems there are some >>>>>>>> issues with shmem swapin handling. Anyway, I need more time to >>>>>>>> debug and test. >>>>>>> To swap in order-6 folios, shmem_split_large_entry() does not >>>>>>> allocate >>>>>>> any new xa_node, since XA_CHUNK_SHIFT is 6. It is weird to see OOM >>>>>>> error below. Let me know if there is anything I can help. >>>>>> >>>>>> I encountered some issues while testing order 4 and order 6 swapin >>>>>> with your patches. And I roughly reviewed the patch, and it seems >>>>>> that the new swap entry stored in the shmem mapping was not >>>>>> correctly updated after the split. >>>>>> >>>>>> The following logic is to reset the swap entry after split, and I >>>>>> assume that the large swap entry is always split to order 0 >>>>>> before. As your patch suggests, if a non-uniform split is used, >>>>>> then the logic for resetting the swap entry needs to be changed? >>>>>> Please correct me if I missed something. >>>>>> >>>>>> /* >>>>>>    * Re-set the swap entry after splitting, and the swap >>>>>>    * offset of the original large entry must be continuous. >>>>>>    */ >>>>>> for (i = 0; i < 1 << order; i++) { >>>>>>       pgoff_t aligned_index = round_down(index, 1 << order); >>>>>>       swp_entry_t tmp; >>>>>> >>>>>>       tmp = swp_entry(swp_type(swap), swp_offset(swap) + i); >>>>>>       __xa_store(&mapping->i_pages, aligned_index + i, >>>>>>              swp_to_radix_entry(tmp), 0); >>>>>> } >>>> >>>> Right. I will need to adjust swp_entry_t. Thanks for pointing this out. >>>> >>>>> >>>>> In addition, after your patch, the shmem_split_large_entry() seems >>>>> always return 0 even though it splits a large swap entry, but we >>>>> still need re-calculate the swap entry value after splitting, >>>>> otherwise it may return errors due to shmem_confirm_swap() >>>>> validation failure. >>>>> >>>>> /* >>>>>   * If the large swap entry has already been split, it is >>>>>   * necessary to recalculate the new swap entry based on >>>>>   * the old order alignment. >>>>>   */ >>>>>   if (split_order > 0) { >>>>>     pgoff_t offset = index - round_down(index, 1 << split_order); >>>>> >>>>>     swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); >>>>> } >>>> >>>> Got it. I will fix it. >>>> >>>> BTW, do you mind sharing your swapin tests so that I can test my new >>>> version >>>> properly? >>> >>> The diff below adjusts the swp_entry_t and returns the right order after >>> shmem_split_large_entry(). Let me know if it fixes your issue. >> >> Fixed the compilation error. It will be great if you can share a >> swapin test, so that >> I can test locally. Thanks. >> >> diff --git a/mm/shmem.c b/mm/shmem.c >> index b35ba250c53d..bfc4ef511391 100644 >> --- a/mm/shmem.c >> +++ b/mm/shmem.c >> @@ -2162,7 +2162,7 @@ static int shmem_split_large_entry(struct inode >> *inode, pgoff_t index, >>   { >>       struct address_space *mapping = inode->i_mapping; >>       XA_STATE_ORDER(xas, &mapping->i_pages, index, 0); >> -    int split_order = 0; >> +    int split_order = 0, entry_order = 0; >>       int i; >> >>       /* Convert user data gfp flags to xarray node gfp flags */ >> @@ -2180,6 +2180,7 @@ static int shmem_split_large_entry(struct inode >> *inode, pgoff_t index, >>           } >> >>           order = xas_get_order(&xas); >> +        entry_order = order; >> >>           /* Try to split large swap entry in pagecache */ >>           if (order > 0) { >> @@ -2192,23 +2193,23 @@ static int shmem_split_large_entry(struct >> inode *inode, pgoff_t index, >>                   xas_try_split(&xas, old, cur_order, GFP_NOWAIT); >>                   if (xas_error(&xas)) >>                       goto unlock; >> + >> +                /* >> +                 * Re-set the swap entry after splitting, and the swap >> +                 * offset of the original large entry must be >> continuous. >> +                 */ >> +                for (i = 0; i < 1 << cur_order; i += (1 << >> split_order)) { >> +                    pgoff_t aligned_index = round_down(index, 1 << >> cur_order); >> +                    swp_entry_t tmp; >> + >> +                    tmp = swp_entry(swp_type(swap), swp_offset(swap) >> + i); >> +                    __xa_store(&mapping->i_pages, aligned_index + i, >> +                           swp_to_radix_entry(tmp), 0); >> +                } >>                   cur_order = split_order; >>                   split_order = >>                       xas_try_split_min_order(split_order); >>               } > > This looks incorrect to me. Suppose we are splitting an order-9 swap > entry, in the first iteration of the loop, it splits the order-9 swap > entry into 8 order-6 swap entries. At this point, the order-6 swap > entries are reset, and everything seems fine. > > However, in the second iteration, where an order-6 swap entry is split > into 63 order-0 swap entries, the split operation itself is correct. But typo: 64 > when resetting the order-0 swap entry, it seems incorrect. Now the > 'cur_order' = 6 and 'split_order' = 0, which means the range for the > reset index is always between 0 and 63 (see __xa_store()). Sorry for confusing. The 'aligned_index' will be rounded down by 'cur_order' (which is 6), so the index is correct. But the swap offset calculated by 'swp_offset(swap) + i' looks incorrect, cause the 'i' is always between 0 and 63. > > +for (i = 0; i < 1 << cur_order; i += (1 << split_order)) { > > +    pgoff_t aligned_index = round_down(index, 1 << cur_order); > > +    swp_entry_t tmp; > > + > > +    tmp = swp_entry(swp_type(swap), swp_offset(swap) + i); > > +    __xa_store(&mapping->i_pages, aligned_index + i, > > +        swp_to_radix_entry(tmp), 0); > > +} > > However, if the index is greater than 63, it appears that it is not set > to the correct new swap entry after splitting. Please corect me if I > missed anyting. > >> - >> -            /* >> -             * Re-set the swap entry after splitting, and the swap >> -             * offset of the original large entry must be continuous. >> -             */ >> -            for (i = 0; i < 1 << order; i++) { >> -                pgoff_t aligned_index = round_down(index, 1 << order); >> -                swp_entry_t tmp; >> - >> -                tmp = swp_entry(swp_type(swap), swp_offset(swap) + i); >> -                __xa_store(&mapping->i_pages, aligned_index + i, >> -                       swp_to_radix_entry(tmp), 0); >> -            } >>           } >> >>   unlock: >> @@ -2221,7 +2222,7 @@ static int shmem_split_large_entry(struct inode >> *inode, pgoff_t index, >>       if (xas_error(&xas)) >>           return xas_error(&xas); >> >> -    return split_order; >> +    return entry_order; >>   } >> >>   /* >> >> >> Best Regards, >> Yan, Zi