From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 21D49C83000 for ; Mon, 30 Jun 2025 06:34:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 902DB6B009A; Mon, 30 Jun 2025 02:34:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8DB146B009B; Mon, 30 Jun 2025 02:34:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 81D5D6B009D; Mon, 30 Jun 2025 02:34:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 6E4ED6B009A for ; Mon, 30 Jun 2025 02:34:33 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 14D251D6359 for ; Mon, 30 Jun 2025 06:34:33 +0000 (UTC) X-FDA: 83611103226.16.DC49B4E Received: from out30-124.freemail.mail.aliyun.com (out30-124.freemail.mail.aliyun.com [115.124.30.124]) by imf16.hostedemail.com (Postfix) with ESMTP id 9A1D2180008 for ; Mon, 30 Jun 2025 06:34:29 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=NUnlAJkF; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf16.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.124 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1751265271; a=rsa-sha256; cv=none; b=odTifeoidAI48jkOX1KE1LAaA6GOoITtNvVY+sbuksVqc+U3KSAmLBzozhITI60F73H95Z po/2+7QgCib/QLfNQNa6lO56UacncGY649MeyjcO/iIKhGY0piPtJKChwkqiOQ4w8UnpGk eL8d0RMfFGc/vT+OC8q+KH1KM5yQjnM= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=NUnlAJkF; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf16.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.124 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1751265271; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+954zuYfOBzQ/NN2RSgZxXJ7Xf4Vhhw6jCTTsBd8IuM=; b=URQz1g9IUpTXQTrU9Ske8mJcUMDWLZMrCf8+rZvdlvZwP+sDxx0FxQT3Qr8WOMHGuTu06K kZ4KccFmW49NB9Ctw7pxHgmNvbfWr+0fqFqYrwZLpbxtKL9j0wZMWtlV6yzqARyE4W07CC RHiJEVJVqjB9Z1pLPYSeh2dqLmXiAS0= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1751265263; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=+954zuYfOBzQ/NN2RSgZxXJ7Xf4Vhhw6jCTTsBd8IuM=; b=NUnlAJkFNd0oD3xwF4LrrtYATy18ZxsKNPLtkv8BpcsYKx51KYeDznx0YHCLtGgIrS9+pNtPMHXo+C2b8EvceG6JCHUi1mTp7ZHDAnL7FkEcxiihjdR4lBRqiEU9FqvBNAZhiIWC70tZh4T9K05Y94+qm6nSynNuKWq8gPwL/90= Received: from 30.74.144.137(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0Wg5U5Yi_1751265260 cluster:ay36) by smtp.aliyun-inc.com; Mon, 30 Jun 2025 14:34:21 +0800 Message-ID: <3d317ce9-2304-4361-adda-32abcc06e3dd@linux.alibaba.com> Date: Mon, 30 Jun 2025 14:34:20 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 4/7] mm/shmem, swap: clean up swap entry splitting To: Kairui Song , linux-mm@kvack.org Cc: Andrew Morton , Hugh Dickins , Matthew Wilcox , Kemeng Shi , Chris Li , Nhat Pham , Baoquan He , Barry Song , linux-kernel@vger.kernel.org References: <20250627062020.534-1-ryncsn@gmail.com> <20250627062020.534-5-ryncsn@gmail.com> From: Baolin Wang In-Reply-To: <20250627062020.534-5-ryncsn@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 9A1D2180008 X-Stat-Signature: qoutz8w7w3tua4tzjjqepmqucsc1a7tf X-HE-Tag: 1751265269-42606 X-HE-Meta: U2FsdGVkX199E2yRr9jLfvD9fz8Hs46TgYEmUPwqvAdYKz5bbAxnOYVIte26OfhAvcVUlswJcLQO6CUVDYoumHEyW4mRUqnMauQ0HxJwjdsQkBMSIXPA1nkdYxOzX2FjJ4JxzOHCIXqnvdl82Bdgtb5PxmsqSIquSbjwTDsXaEYnO+ICP5q6FNa2Ndcm7qa/Z3IUDjtCdWHu9yKemHbIqRRul7dvKTgJgU32xEDN4JIQqKhgnXUspp7xSsSlvrW0HXiFD2vwMYQbTrTfgVZ4yQ0uPchFLcSwx3LxUg2xfdflpD5CCAI2fiFcoZq1gGaCdC3XSxsb4RTxWnYzcbWJ1qo/kwxOrFp0ib7GuNXsfVyxZcG1majBod6laoQmIxan/+VJ+NP2LZhIqNd2xUOIk2TqOAqcRErRTD0WW5NQCqsq83YkaGhzwcgKhRaFJ+Nu9TFfu2+WEzg9xLXG5CYYZruCi5JxjHUSSAMtC1CzS7R4HV1WqgTyFh2e4IbyjwjgAnoq1zoNh564ArI9IaGnRe/FLIMXV2el+povfUw6hxMPDfA/wBwuRzLMmawRqhw2ZyVQ+9aRfo3x9qEHl54dEdf5C6igNW8Bqx/RN5OP6nV/aY2yv2g5lH8EJglVvLG+BL2iaFM/BPk7IY3n2PwlfB8NREJroNoyDTXvxejnw1YybHxFmsqMTiqa4htsPwZnOc5N+UafYlwqkzDYFNILwUdCHalO7cUQ2jwll1Oz43ZZNar+nlj+cR9r/BLKu4aRGri6KO/SwoGXvib9BOp/XgVBoMig4iKqwkoYBIeAZFwAHrb6k8P4UTr+OJSwesOqEYxoVKzihDmQ5XFAjuGfGuLGQIWvyk0MGClMMUddmDcCvsbPgkJ1G9UEZX3V6iDeZriFK23S18yFef8ARqWCVyqA600OjNSVHt7kNBLsPvWg7NbXef16OUuk2vi2dzSNvMRnA4WaBIdBSaGnInz LDxlOjHm n9fkUqd5hlPiby6WBv+Q8wkZaWJN476uPHhC9l2DiW89XzQQ4//w//nq2GhwiEHBeqlHCKDltcoLKHeivEUTqO9vH3tHX4XSzLLbiGGCl2Wxk76BrqR60RswFufUzzwbTSKyJLWC9iYQkub99NBiC5mrbKTM9PdFIFCRqGKRCEhk8JuGNzMff3N1hIQzI+y6lQFuofXwG67a2Pyry83G3PhNHOgurkpsWxO6DUsuKhnKmss3drFPKczYOAw3fTyN+LKbB8RL8eLwBAQObaANvDZwNMrhGwpg3JEIuysRKee7lrzhpYS2n/eUU0Fd0+M3OKLjGqk3gh/sTSB2B5v2Nxlrt26pKI2ZitjTF X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/6/27 14:20, Kairui Song wrote: > From: Kairui Song > > Instead of keeping different paths of splitting the entry and > recalculating the swap entry and index, do it in one place. > > Whenever swapin brought in a folio smaller than the entry, split the > entry. And always recalculate the entry and index, in case it might > read in a folio that's larger than the entry order. This removes > duplicated code and function calls, and makes the code more robust. > > Signed-off-by: Kairui Song > --- > mm/shmem.c | 103 +++++++++++++++++++++-------------------------------- > 1 file changed, 41 insertions(+), 62 deletions(-) > > diff --git a/mm/shmem.c b/mm/shmem.c > index f85a985167c5..5be9c905396e 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -2178,8 +2178,12 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index, > swap_free_nr(swap, nr_pages); > } > > -static int shmem_split_large_entry(struct inode *inode, pgoff_t index, > - swp_entry_t swap, gfp_t gfp) > +/* > + * Split an existing large swap entry. @index should point to one sub mapping > + * slot within the entry @swap, this sub slot will be split into order 0. > + */ > +static int shmem_split_swap_entry(struct inode *inode, pgoff_t index, > + swp_entry_t swap, gfp_t gfp) > { > struct address_space *mapping = inode->i_mapping; > XA_STATE_ORDER(xas, &mapping->i_pages, index, 0); > @@ -2250,7 +2254,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index, > if (xas_error(&xas)) > return xas_error(&xas); > > - return entry_order; > + return 0; > } > > /* > @@ -2267,11 +2271,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, > struct address_space *mapping = inode->i_mapping; > struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL; > struct shmem_inode_info *info = SHMEM_I(inode); > + int error, nr_pages, order, swap_order; > struct swap_info_struct *si; > struct folio *folio = NULL; > bool skip_swapcache = false; > swp_entry_t swap; > - int error, nr_pages, order, split_order; > > VM_BUG_ON(!*foliop || !xa_is_value(*foliop)); > swap = radix_to_swp_entry(*foliop); > @@ -2321,70 +2325,43 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, > goto failed; > } > > - /* > - * Now swap device can only swap in order 0 folio, then we > - * should split the large swap entry stored in the pagecache > - * if necessary. > - */ > - split_order = shmem_split_large_entry(inode, index, swap, gfp); > - if (split_order < 0) { > - error = split_order; > - goto failed; > - } > - > - /* > - * If the large swap entry has already been split, it is > - * necessary to recalculate the new swap entry based on > - * the old order alignment. > - */ > - if (split_order > 0) { > - pgoff_t offset = index - round_down(index, 1 << split_order); > - > - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); > - } > - > /* Here we actually start the io */ > folio = shmem_swapin_cluster(swap, gfp, info, index); > if (!folio) { > error = -ENOMEM; > goto failed; > } > - } else if (order > folio_order(folio)) { > - /* > - * Swap readahead may swap in order 0 folios into swapcache > - * asynchronously, while the shmem mapping can still stores > - * large swap entries. In such cases, we should split the > - * large swap entry to prevent possible data corruption. > - */ > - split_order = shmem_split_large_entry(inode, index, swap, gfp); > - if (split_order < 0) { > - folio_put(folio); > - folio = NULL; > - error = split_order; > - goto failed; > - } > - > - /* > - * If the large swap entry has already been split, it is > - * necessary to recalculate the new swap entry based on > - * the old order alignment. > - */ > - if (split_order > 0) { > - pgoff_t offset = index - round_down(index, 1 << split_order); > - > - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); > - } > - } else if (order < folio_order(folio)) { > - swap.val = round_down(swap.val, 1 << folio_order(folio)); > } > > alloced: > + /* > + * We need to split an existing large entry if swapin brought in a > + * smaller folio due to various of reasons. > + * > + * And worth noting there is a special case: if there is a smaller > + * cached folio that covers @swap, but not @index (it only covers > + * first few sub entries of the large entry, but @index points to > + * later parts), the swap cache lookup will still see this folio, > + * And we need to split the large entry here. Later checks will fail, > + * as it can't satisfy the swap requirement, and we will retry > + * the swapin from beginning. > + */ > + swap_order = folio_order(folio); Nit: 'swap_order' is confusing, and can you just use folio_order() or a btter name? > + if (order > swap_order) { > + error = shmem_split_swap_entry(inode, index, swap, gfp); > + if (error) > + goto failed_nolock; > + } > + > + index = round_down(index, 1 << swap_order); > + swap.val = round_down(swap.val, 1 << swap_order); The round_down() of index and swap value here may cause shmem_add_to_page_cache() to fail to insert a new folio, because the swap value stored at that index in the shmem mapping does not match, leading to another swapin page fault for correction. For example, shmem stores a large swap entry of order 4 in the range of index 0-64. When a swapin fault occurs at index = 3, with swap.val = 0x4000, if a split happens and this round_down() logic is applied, then index = 3, swap.val = 0x4000. However, the actual swap.val should be 0x4003 stored in the shmem mapping. This would cause another swapin fault. I still prefer my original alignment method, and do you find this will cause any issues? " if (split_order > 0) { pgoff_t offset = index - round_down(index, 1 << split_order); swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); } " > + > /* We have to do this with folio locked to prevent races */ > folio_lock(folio); > if ((!skip_swapcache && !folio_test_swapcache(folio)) || > folio->swap.val != swap.val) { > error = -EEXIST; > - goto unlock; > + goto failed_unlock; > } > if (!folio_test_uptodate(folio)) { > error = -EIO; > @@ -2405,8 +2382,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, > goto failed; > } > > - error = shmem_add_to_page_cache(folio, mapping, > - round_down(index, nr_pages), > + error = shmem_add_to_page_cache(folio, mapping, index, > swp_to_radix_entry(swap), gfp); > if (error) > goto failed; > @@ -2417,8 +2393,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, > folio_mark_accessed(folio); > > if (skip_swapcache) { > + swapcache_clear(si, folio->swap, folio_nr_pages(folio)); > folio->swap.val = 0; > - swapcache_clear(si, swap, nr_pages); > } else { > delete_from_swap_cache(folio); > } > @@ -2434,13 +2410,16 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, > if (error == -EIO) > shmem_set_folio_swapin_error(inode, index, folio, swap, > skip_swapcache); > -unlock: > - if (skip_swapcache) > - swapcache_clear(si, swap, folio_nr_pages(folio)); > - if (folio) { > +failed_unlock: > + if (folio) > folio_unlock(folio); > - folio_put(folio); > +failed_nolock: > + if (skip_swapcache) { > + swapcache_clear(si, folio->swap, folio_nr_pages(folio)); > + folio->swap.val = 0; > } > + if (folio) > + folio_put(folio); > put_swap_device(si); > return error; > }