From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 948C7C71157 for ; Tue, 17 Jun 2025 18:35:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1EEBA6B00B1; Tue, 17 Jun 2025 14:35:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1C6AD6B00B2; Tue, 17 Jun 2025 14:35:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0DCB46B00B3; Tue, 17 Jun 2025 14:35:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id EFC996B00B1 for ; Tue, 17 Jun 2025 14:35:42 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id BBD4E1A055D for ; Tue, 17 Jun 2025 18:35:42 +0000 (UTC) X-FDA: 83565746124.04.DC09942 Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com [209.85.216.51]) by imf24.hostedemail.com (Postfix) with ESMTP id CB9F3180008 for ; Tue, 17 Jun 2025 18:35:40 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Jz+uqrML; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf24.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.216.51 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750185340; a=rsa-sha256; cv=none; b=bbRf+JphKTdr1PMphHVd5WPe19X8B1/47PHLbgljqAX+TIBiV/LLt6aYw5gHlptZy95uIf HdCO0qTMHaA/ZUiCoMVWQqNZJWuNDXdOWPHvOoD8dEz2L8NmLlX/a8OStYEiWxTMPuUHl0 6cH1vwVdtTrwdQxwC5w9XN+urmMxRQA= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Jz+uqrML; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf24.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.216.51 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750185340; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VwmU5QFgBrV4IT7f3BoX3RhTubCWuigzMCMXoFU7oOA=; b=jerJrbxe04So4gIyIM9z7EpzVwdmRj2c1iQ0R6vRtssj3D0FAitSK9tiRi9xTpRTMhZozj dXVMQEUGwcf64AjGnt6XOIt4EqF7vsZjnbANvabxK64BKf3JuCpK9ZMVmqG5F+pnahofVN u92P7XkxVGC68Rrm4xiefUXcHjzlHoE= Received: by mail-pj1-f51.google.com with SMTP id 98e67ed59e1d1-3141f9ce4d1so1913619a91.2 for ; Tue, 17 Jun 2025 11:35:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1750185338; x=1750790138; darn=kvack.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=VwmU5QFgBrV4IT7f3BoX3RhTubCWuigzMCMXoFU7oOA=; b=Jz+uqrMLRJY5ciNwMf/+FOi/5tyhczwu19hSSssEBwVOtmWG+9R2Wy88OEil9++O9W vo7cEhDtLfbyqJuSUICvRB7gnQ1Svhl9KZua2at7QXqm+eKIRbGrUW6kTVkodKEAxsOU 0uG50iBee93kPpHRid1t7PvCEZeIGbTuzShaRkMwfztB7JXKv0mGisfjC5AqLWFiiSLv DuzZNagx1nJPG24/9Rio6HSmDZiBecwlf/zl7er4EvvPdbssw+8qo+39jqBW0zfQbq4d T3ajbDpf7ZLPx95Lt2tCOa/yHA+zeoo43Lo6p+0Z7maAt4kHIKiezoTm21ukf08wJBZZ K6YA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750185338; x=1750790138; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=VwmU5QFgBrV4IT7f3BoX3RhTubCWuigzMCMXoFU7oOA=; b=il9BVcImKSUtybf2EIKCIFYex6zXaI3N6Ejomv6BKxScCtvRUwzizhXOzDmMKrvWHn c7xUY/TaPwTihpJHfYON8FlUrBgTCPaPBfMeyUZzbKL6kMTap/zHXGQ/dxqCHTmBtjpK 6L2agolkoVC51dtNj4hYi9gzHhWSHIkcx7MkqpbGkjPj/hef5F0fVuO9QNgpTTDCAHO8 kybDtc+43lg2gaxTrLzbKMmjNB/+k7+YQ023G9dNCxGaS+1hZzqV/FJyFF/13jZWJbsH e+Y8IBUxx2j2sxlIMXvcYa8UdljHxoVqQH3gJhFoOFD6aNRT9ewMk2uK/AgaKNJHC0gP xTfQ== X-Gm-Message-State: AOJu0Yyj7iP6AOb9/4vuHlkdgrioPn2NWQRpkeXhyGi5G4rUuEO6f6sY BVejG6K3Y4irffasX+QFq6ydJfzhhELnHj9vm8PJdafy7N4K41O0QXIahRzV/RQGQsw= X-Gm-Gg: ASbGnctUwiAGsOI1h7XqUUemz5ZB6VZsyENJ80RYWHvPTaExZO/D2itVppQDZVXAYNX T5MIlj3PBBpmLXkGp4G4KO+h+Kl20GTASHE/nq3OM72q6WmD/g1pFMMErD1u3Uj24EXjzWrVHpC a5xEL+FoWDJg7OBTv7BSpqgFDwv7L7ufuulX9usGY5mfo1rnQ84uKCKjQFjdnyUKAn0py0GJ+Gr lzEkXrGviI2P96VCUCkXVxrEpQElm/UQZXdien+yGH6IQTZkkMCOmRJrO68U6uKZloKonQmWVZs 74GApftOy/AvGZG7bol/05R3F64kZkgrllkJO5rrmku5UPhhZI1NYoRToTPl4P24gwSn399xXQH 2/n8Z3cY= X-Google-Smtp-Source: AGHT+IG+2p7b+3vUcWPcVCyiKJMHqLk7htZb0HpMOmCbryFbII9jSUjLr5nsIVjixeyzH2l1GmrmdQ== X-Received: by 2002:a17:90b:1e07:b0:312:eaea:afa1 with SMTP id 98e67ed59e1d1-313f1e2bfa7mr22198598a91.29.1750185338375; Tue, 17 Jun 2025 11:35:38 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([106.37.123.168]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2365de781c7sm83753715ad.128.2025.06.17.11.35.33 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 17 Jun 2025 11:35:37 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Hugh Dickins , Baolin Wang , Matthew Wilcox , Kemeng Shi , Chris Li , Nhat Pham , Baoquan He , Barry Song , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 3/4] mm/shmem, swap: improve mthp swapin process Date: Wed, 18 Jun 2025 02:35:02 +0800 Message-ID: <20250617183503.10527-4-ryncsn@gmail.com> X-Mailer: git-send-email 2.50.0 In-Reply-To: <20250617183503.10527-1-ryncsn@gmail.com> References: <20250617183503.10527-1-ryncsn@gmail.com> Reply-To: Kairui Song MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: CB9F3180008 X-Stat-Signature: p3a6xeeqebnxbakexg7pmbygs1xno1rn X-Rspam-User: X-HE-Tag: 1750185340-964063 X-HE-Meta: U2FsdGVkX1+Vajs6egr3m4HsUi9J5CPDxarT9zUIxN6gDR+rH448Pgx2g0zlFQH4kxCEYl2bGMWBt6/06ZR7BVIrR2wq4PAVxUUguOBcWxmqJWjJrNIVzEGABvTSObv3sJFnKFKYJj+EzK49BZUjRUi16NVW1U+cAp94+Vc2IBF+4p7hRSBObku0G45vOx9HCM+EvLbOwB67aSiyAe/BPtsVRjOOhq047VosAOwISjdQrdk5QvuYFJgE4vwlSbRHJ7b2cIS9xgyj99V3NmHi+LIlY5/cz9xvyRAyn8z6izY1nRD5G1LEMMkK1Sx0WThBf/2pn0P6hh8QxWyFGXZrDUJWUQGH9G3s3tnJm5fhmTbNFrN5Xe4CubJjKIE9dFVPzJ3r3rU2VbCa74oMK92N4x4Fw8DrCHRtLx3F/JEWCmFiS4tfdxrH+SpOSRT2g4n5WEi7bzqR83YObxv3KoUzoInrDqmxtX+fAxUk7TeVWPa2xj1wfUVUgwk88DOeLe14uMed48zFArsUS7u7KIaFazVLBjNXUGGJycX+cQ82VjDX8OZBGw/0LxKRwSA2SHZp1dNWgy4o7FQj3LSmAxhRHzkX0OvryQpgjQXzjOMlZEDsnig9O5fELCcVRnWG1ya3uBfa7+ZVEOAcLJmLhXUhvEeSDzJY/6YviELJMZlVVQizXWMKfYT8UVc3joBw4XTMJuq81MsuAI5r57MJ6V8mxfIVAm5We6HBIwLrKXpY1PoGbCRce7wvwUlzDtBnDVsvmlyC5lI+j4U2FuKqVE1ZDEWcs9dor2H8WE8dH1tpvOvYPaCnlU1iM5NUh8dVCwuki3bCTdJ0yqXH/YJ4Eewtn+9v+a4sgq/gOWM1qr7zOmiO4L3qzYd+0YYPcwIzIOGguqxrqIfnYnFDGkmXzZ+deV6O3Oc0cdqf9BIPg7PLQ5wbqv879hinv0ZCm+H1nxB29wADuN18FKWLKU9jGYO xUngcqo8 Ry9fOkaUG7TK2DWRhkUsTCtXIKk23FW4RjkEKVU7UF7bAw6l5mwIN1yqwyW0jsvrKFUY9G3ZULk1IopXtkQGZk3XegYeLdesK31vOX0e8L1oOrRwu5du+HhTIdDvqsqzwqu9TrTpdjBOKXoX9nYzoM6q93g8uOZx3e3leqrPAI5Re4Sd+WcT92hakH6Twl4C+amPBgNKIN29cJ5XdpXWhXurGCYxxyGrqzDi755sL8YqXWp96h7xaFaHWwcDZ8K6jqCw5MkkxhZV6o1VWQLRvdHhLYuifGuY0gu4rLEWvZ6cbyZxpeNdNcbzPudBohDBUrk5hyuamrTeFE5yT+KNd9tr7ZonSyIFAsmn8D+ht3csS0PCq68VwUGtURd/STOZDTHhcdw9vaM9Ml7OENhkNt31AnOLPrFRtvhRiHfh6gdSrBs1jDZg4Ly2PGkao5c+D7KMVHZhdLkAum1Ly7fvzWtoX7acWcSSEFm2/3KDi6CxKoExSzfNG+wWIb1jGCkzitxR3u6BfHsEqNifzHtPYEJKbbyLnGqWw2Z584NvyGu+eXp5gxfRYkDP5zVJcs7CbrEc2 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song Tidy up the mTHP swapin workflow. There should be no feature change, but consolidates the mTHP related check to one place so they are now all wrapped by CONFIG_TRANSPARENT_HUGEPAGE, and will be trimmed off by compiler if not needed. Signed-off-by: Kairui Song --- mm/shmem.c | 175 ++++++++++++++++++++++++----------------------------- 1 file changed, 78 insertions(+), 97 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index 0ad49e57f736..46dea2fa1b43 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1975,31 +1975,51 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf, return ERR_PTR(error); } -static struct folio *shmem_swap_alloc_folio(struct inode *inode, +static struct folio *shmem_swapin_direct(struct inode *inode, struct vm_area_struct *vma, pgoff_t index, - swp_entry_t entry, int order, gfp_t gfp) + swp_entry_t entry, int *order, gfp_t gfp) { struct shmem_inode_info *info = SHMEM_I(inode); + int nr_pages = 1 << *order; struct folio *new; + pgoff_t offset; void *shadow; - int nr_pages; /* * We have arrived here because our zones are constrained, so don't * limit chance of success with further cpuset and node constraints. */ gfp &= ~GFP_CONSTRAINT_MASK; - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && order > 0) { - gfp_t huge_gfp = vma_thp_gfp_mask(vma); + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { + if (WARN_ON_ONCE(*order)) + return ERR_PTR(-EINVAL); + } else if (*order) { + /* + * If uffd is active for the vma, we need per-page fault + * fidelity to maintain the uffd semantics, then fallback + * to swapin order-0 folio, as well as for zswap case. + * Any existing sub folio in the swap cache also blocks + * mTHP swapin. + */ + if ((vma && userfaultfd_armed(vma)) || + !zswap_never_enabled() || + non_swapcache_batch(entry, nr_pages) != nr_pages) { + offset = index - round_down(index, nr_pages); + entry = swp_entry(swp_type(entry), + swp_offset(entry) + offset); + *order = 0; + nr_pages = 1; + } else { + gfp_t huge_gfp = vma_thp_gfp_mask(vma); - gfp = limit_gfp_mask(huge_gfp, gfp); + gfp = limit_gfp_mask(huge_gfp, gfp); + } } - new = shmem_alloc_folio(gfp, order, info, index); + new = shmem_alloc_folio(gfp, *order, info, index); if (!new) return ERR_PTR(-ENOMEM); - nr_pages = folio_nr_pages(new); if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL, gfp, entry)) { folio_put(new); @@ -2165,8 +2185,12 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index, swap_free_nr(swap, nr_pages); } -static int shmem_split_large_entry(struct inode *inode, pgoff_t index, - swp_entry_t swap, gfp_t gfp) +/* + * Split an existing large swap entry. @index should point to one sub mapping + * slot within the entry @swap, this sub slot will be split into order 0. + */ +static int shmem_split_swap_entry(struct inode *inode, pgoff_t index, + swp_entry_t swap, gfp_t gfp) { struct address_space *mapping = inode->i_mapping; XA_STATE_ORDER(xas, &mapping->i_pages, index, 0); @@ -2226,7 +2250,6 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index, cur_order = split_order; split_order = xas_try_split_min_order(split_order); } - unlock: xas_unlock_irq(&xas); @@ -2237,7 +2260,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index, if (xas_error(&xas)) return xas_error(&xas); - return entry_order; + return 0; } /* @@ -2254,11 +2277,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, struct address_space *mapping = inode->i_mapping; struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL; struct shmem_inode_info *info = SHMEM_I(inode); + int error, nr_pages, order, swap_order; struct swap_info_struct *si; struct folio *folio = NULL; bool skip_swapcache = false; swp_entry_t swap; - int error, nr_pages, order, split_order; VM_BUG_ON(!*foliop || !xa_is_value(*foliop)); swap = radix_to_swp_entry(*foliop); @@ -2283,110 +2306,66 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, /* Look it up and read it in.. */ folio = swap_cache_get_folio(swap, NULL, 0); if (!folio) { - int nr_pages = 1 << order; - bool fallback_order0 = false; - /* Or update major stats only when swapin succeeds?? */ if (fault_type) { *fault_type |= VM_FAULT_MAJOR; count_vm_event(PGMAJFAULT); count_memcg_event_mm(fault_mm, PGMAJFAULT); } - - /* - * If uffd is active for the vma, we need per-page fault - * fidelity to maintain the uffd semantics, then fallback - * to swapin order-0 folio, as well as for zswap case. - * Any existing sub folio in the swap cache also blocks - * mTHP swapin. - */ - if (order > 0 && ((vma && unlikely(userfaultfd_armed(vma))) || - !zswap_never_enabled() || - non_swapcache_batch(swap, nr_pages) != nr_pages)) - fallback_order0 = true; - - /* Skip swapcache for synchronous device. */ - if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) { - folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp); + /* Try direct mTHP swapin bypassing swap cache and readahead */ + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) { + swap_order = order; + folio = shmem_swapin_direct(inode, vma, index, + swap, &swap_order, gfp); if (!IS_ERR(folio)) { skip_swapcache = true; goto alloced; } - - /* - * Fallback to swapin order-0 folio unless the swap entry - * already exists. - */ + /* Fallback if order > 0 swapin failed with -ENOMEM */ error = PTR_ERR(folio); folio = NULL; - if (error == -EEXIST) + if (error != -ENOMEM || !swap_order) goto failed; } - /* - * Now swap device can only swap in order 0 folio, then we - * should split the large swap entry stored in the pagecache - * if necessary. + * Try order 0 swapin using swap cache and readahead, it still + * may return order > 0 folio due to raced swap cache. */ - split_order = shmem_split_large_entry(inode, index, swap, gfp); - if (split_order < 0) { - error = split_order; - goto failed; - } - - /* - * If the large swap entry has already been split, it is - * necessary to recalculate the new swap entry based on - * the old order alignment. - */ - if (split_order > 0) { - pgoff_t offset = index - round_down(index, 1 << split_order); - - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); - } - - /* Here we actually start the io */ folio = shmem_swapin_cluster(swap, gfp, info, index); if (!folio) { error = -ENOMEM; goto failed; } - } else if (order > folio_order(folio)) { - /* - * Swap readahead may swap in order 0 folios into swapcache - * asynchronously, while the shmem mapping can still stores - * large swap entries. In such cases, we should split the - * large swap entry to prevent possible data corruption. - */ - split_order = shmem_split_large_entry(inode, index, swap, gfp); - if (split_order < 0) { - folio_put(folio); - folio = NULL; - error = split_order; - goto failed; - } - - /* - * If the large swap entry has already been split, it is - * necessary to recalculate the new swap entry based on - * the old order alignment. - */ - if (split_order > 0) { - pgoff_t offset = index - round_down(index, 1 << split_order); - - swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); - } - } else if (order < folio_order(folio)) { - swap.val = round_down(swp_type(swap), folio_order(folio)); } - alloced: + /* + * We need to split an existing large entry if swapin brought in a + * smaller folio due to various of reasons. + * + * And worth noting there is a special case: if there is a smaller + * cached folio that covers @swap, but not @index (it only covers + * first few sub entries of the large entry, but @index points to + * later parts), the swap cache lookup will still see this folio, + * And we need to split the large entry here. Later checks will fail, + * as it can't satisfy the swap requirement, and we will retry + * the swapin from beginning. + */ + swap_order = folio_order(folio); + if (order > swap_order) { + error = shmem_split_swap_entry(inode, index, swap, gfp); + if (error) + goto failed_nolock; + } + + index = round_down(index, 1 << swap_order); + swap.val = round_down(swap.val, 1 << swap_order); + /* We have to do this with folio locked to prevent races */ folio_lock(folio); if ((!skip_swapcache && !folio_test_swapcache(folio)) || folio->swap.val != swap.val) { error = -EEXIST; - goto unlock; + goto failed_unlock; } if (!folio_test_uptodate(folio)) { error = -EIO; @@ -2407,8 +2386,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, goto failed; } - error = shmem_add_to_page_cache(folio, mapping, - round_down(index, nr_pages), + error = shmem_add_to_page_cache(folio, mapping, index, swp_to_radix_entry(swap), gfp); if (error) goto failed; @@ -2419,8 +2397,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, folio_mark_accessed(folio); if (skip_swapcache) { + swapcache_clear(si, folio->swap, folio_nr_pages(folio)); folio->swap.val = 0; - swapcache_clear(si, swap, nr_pages); } else { delete_from_swap_cache(folio); } @@ -2436,13 +2414,16 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, if (error == -EIO) shmem_set_folio_swapin_error(inode, index, folio, swap, skip_swapcache); -unlock: - if (skip_swapcache) - swapcache_clear(si, swap, folio_nr_pages(folio)); - if (folio) { +failed_unlock: + if (folio) folio_unlock(folio); - folio_put(folio); +failed_nolock: + if (skip_swapcache) { + swapcache_clear(si, folio->swap, folio_nr_pages(folio)); + folio->swap.val = 0; } + if (folio) + folio_put(folio); put_swap_device(si); return error; } -- 2.50.0