From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 599FACCD194 for ; Wed, 15 Oct 2025 16:46:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A30668E0044; Wed, 15 Oct 2025 12:46:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9E19E8E0005; Wed, 15 Oct 2025 12:46:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8CFAF8E0044; Wed, 15 Oct 2025 12:46:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 79A278E0005 for ; Wed, 15 Oct 2025 12:46:33 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 267FB118BD1 for ; Wed, 15 Oct 2025 16:46:33 +0000 (UTC) X-FDA: 84000927066.14.672ADCD Received: from mail-ej1-f49.google.com (mail-ej1-f49.google.com [209.85.218.49]) by imf21.hostedemail.com (Postfix) with ESMTP id 45E921C000C for ; Wed, 15 Oct 2025 16:46:31 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=io4cA57o; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf21.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1760546791; a=rsa-sha256; cv=none; b=23UWNbv0haTMC55qROG7SzlgtC28sNsT/jONpiwcrxTE674l7zqL+LreiU14ZsAzXzodKE ot+e+zWZmFPtV59LqmsttRKBm/HGG8Ju1QPa28qZnKgFWY3iz8PK4fQpKm9jKWTonX7YqA 4S6FsvS4cHbEcW7VoNFtyHm+XWY+OrQ= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=io4cA57o; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf21.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1760546791; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=je0F7nfACXJpbmYznXu1sTtkMbdp1GFHRjbs99COgfU=; b=XisB3ka6fngS/ShjffUrAzSj+CazlZ47jCJgv5850uRZDrHCMslc45hV3Y1Mcp/I0S7uaC V8Zg7vbLcI+zsDQ/k1M4P3CQTUW9cFpVBNFRtVZk/MX35C8eSiK23iA5eQ7HUy+he7nXKE c02E53UktG59mFH09qIKrNdTS4J9c64= Received: by mail-ej1-f49.google.com with SMTP id a640c23a62f3a-b48d8deafaeso1401830966b.1 for ; Wed, 15 Oct 2025 09:46:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1760546790; x=1761151590; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=je0F7nfACXJpbmYznXu1sTtkMbdp1GFHRjbs99COgfU=; b=io4cA57oOTqPCHT8QwCEUOjz5KNDgtTZEmM7s6K1ii+FdNHATvVhJzU6exBy0d90SH wTMSQgdvseFLI5OH0Uo2sjCTYkI2gHPTWhMEquiHGz1h5+IW78aQffgqeoGJdfazAVAE UfJY0Jcsd1EwFuNX9FeaOtoGiFnrgEBTis+4PlFq9lIWWZtHYAehPqh98E/VxRbLaWsf PtsuS4U0oK4V2fcwqPBDnNb0K2XPhezwzLc8tHMmtj62xunZ0kqpd+UROxlP6uj61NaT R63uf62Dr5k6dLK3kWBZLxPntDYRAVXwQQ+Ehu7WQagRwWYy1pmemj7TDnC53WrtQonP dxWw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1760546790; x=1761151590; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=je0F7nfACXJpbmYznXu1sTtkMbdp1GFHRjbs99COgfU=; b=cMf+ldC8eA3u8ViVYxVAM07Gnb1OoWLN9O8yl+GiQEzsdgDqUENUdo/P0JzrKRlyLm Ng34hYJ4WMG3xCW8vciMhxG/mWh2dqJpFTU025ohg+r9XahDa9U69OwcX5sihzdLM5qp 8GoPa8S5AChDMlBxrQWSAhVMtll50r/Be/oF7psdEqZcoS+aIyi/48V6PRh5xFvrXAse rhV+6mBZdF5FXB3WNTq3q42s4lmabBn2JKpJ92q6zI9KY5MEJs/K9ykv6hWXSr1wC9fO 5n8632ZuthL69qLeWLOSregEZzVQnWH6RFDyK5KiDPzr1E2XfFZ3sEj/9UyF4EUu6Hvm W9vw== X-Gm-Message-State: AOJu0YzxX2OvAKoPnp7VQR6Om+Ivx7BWYZue4inNVqvfTu3TI/46ihEA h/m3K2U9lveo9kmp4nf5XSsqsqdfdnANWxSp69g8KzgGlDeOQXGxN/SyQrehXuOPDQ/nDNs/byx y+3tXLtpYTmLzX5xmFhG1qLykCXc6w8U= X-Gm-Gg: ASbGncs4s8Y2n7SFu5kxD9yQI9OqpUykiSbqaqu7pmVwrfhVRULkHA9MtJBHaCqvx6I VcfYxzUEs9awB7hw7Fx6SZDXay8EzFLQHwPwT8Tmz6PhK6zONNxvAK49MA1qvORScXg7tDZwlG+ NAuzJfsRecVlSZm3OuR4UyQcxXRxKO8PfvCpB7p6EKSeGk+up9Tiax7Ih1jI46Fudm/yROel2dv CZtEhr7qqsueCuH3i8kHPkHqA== X-Google-Smtp-Source: AGHT+IG6mst+qVI0JUb5puzOWQMXSlytfQLY39/u3UCs+EtF1W8BL9hwi0uE2sSR4jvdT6ZVos7cZl+TB59DjLKsk2o= X-Received: by 2002:a17:907:980f:b0:b42:9840:eac5 with SMTP id a640c23a62f3a-b50acb0e645mr3174683566b.61.1760546789493; Wed, 15 Oct 2025 09:46:29 -0700 (PDT) MIME-Version: 1.0 References: <20251007-swap-clean-after-swap-table-p1-v1-0-74860ef8ba74@tencent.com> <20251007-swap-clean-after-swap-table-p1-v1-1-74860ef8ba74@tencent.com> In-Reply-To: From: Kairui Song Date: Thu, 16 Oct 2025 00:45:52 +0800 X-Gm-Features: AS18NWBxPhombhPrk2cbIiwnOB7SZCpnaOyVcUMkzwhYwQ_vXmWp_OG4dB0nf-M Message-ID: Subject: Re: [PATCH 1/4] mm, swap: do not perform synchronous discard during allocation To: Chris Li , YoungJun Park Cc: linux-mm@kvack.org, Andrew Morton , Kemeng Shi , Nhat Pham , Baoquan He , Barry Song , Baolin Wang , David Hildenbrand , "Matthew Wilcox (Oracle)" , Ying Huang , linux-kernel@vger.kernel.org, stable@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: fitk31w1bdaudb956uhdjgp1t4rbw47h X-Rspamd-Queue-Id: 45E921C000C X-Rspamd-Server: rspam06 X-Rspam-User: X-HE-Tag: 1760546791-894270 X-HE-Meta: U2FsdGVkX19CjpPvxU1fWhA2yhKNf1VIv9v0aWBsA1let4jJaczrz5UrHTqhRTmr0aKfpVdVjlR9aGba/pyk9tSjtUSibclDXL9xVoe+PXFtCVVY+EL33qUZVaw5Di6cuFg1iLp8dZU/MxZ5VPi7EJ56tEb8+7bLAc6VmmEXCUZfGadHKVq/soeK9Fnev1oQsWkbYEgzki4HPUdm9ocjBGDA0dHR9Oo5knvHiOHgX7vm1UIBhD1PCAOuSaRqkVGA47WItt3j2ymzy1mI9QCaH+Cq+tuF5OD3YEsOIEB/If0cnl2jw6bTJDpNV3G067uxNWr21wREqUPGVvtFGYCMqht5OXzI4gEW+u3fAiPYEaSnywiXlLVhMnf1kuKNAU5dqesh8UILgsFYgmlSFxB3gxOGXJ/i+vyfmQmKDBy7Cbhs7TOu3p1IPSYfc/FdUb7SZ/joF9X2zf97AJVUI1sMXOy373qvNC+vK4jb+r2UTjFSDBhwDh7g0B5sdjlHAZTItStjlZ3UhBXN0fnEgcoGR6uqB32hEY0ENW7fTy1vs/w3nYzVzsutfvjTAarjCFf65wIjKeYg71oXfgRQSO1I3n3qUzQyNHHfVZoDo7zlJgA6bqlLvy0l6OnK8M6YQ0FG76X6gbXCCSHzU3En5Dv/8VdSxwOmdLDIZN0idl+gaWGnNFpU3rm7VKqZ2e/8TwrmoCC97/0HC3iuG/RvttsANwaO0rtr0qRUNAk97oI1iWSHC+tbqgTmALr86q9pnHb2aLGnHA5YaCBblqnUxxJaGW6c04ogCJBL4Ls07/R4B6oGoKuIgZvPfvXl+yJmxlleUt2NJxg6sIeWTj2qMGjhUEOcYUsusyV2PGuH3vvh49cu1u2HBK+qRee/rPaYtdPqh9pSmu1/mgmaHhtxmXJBb657brssJoBAiHiCmNAHas2c6GMnVLwxT2hlcfSuw0p3y37c/i9TRoh4BFMLFXC CSkR4Xpl +brsVcPGZvToIocWvTFzLtSzvlM8JZPptiSCGjrDLf/yHyktD/kcNU9uIh6p19xbhUXAhIEDjoVoFGXasfMZ8fPIPY7qx5QwIKqPvqPeKXJ2dsai7gvom1SXQ7eMK/NAqZJsMYmUQIsVdBnMChrqjHXAr6kTw+HM45uHRCubbNlF4z5J4H+mdQXvTrwCkO/FklM12a72NAdvdj1YCv16jjBBkqpvmw6IPvHu4ufoMg4QemHjIhLH+jQiWLB+UL5IjpG8rWMTIS2nbx1Pu5aqqufrKd0f+8Jt556zOQUggE7VIajd6psFr+y3YQfUw9IxTeTJYTTIixMh8M/jVbAHjjuKGY+Y62j2X8z5x+xen5dUA3m7fuLbCs1ATpeOTtwkHJYS7dtNpMtJCIes9LgmgnVZIEg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Oct 15, 2025 at 2:24=E2=80=AFPM Kairui Song wrot= e: > > On Wed, Oct 15, 2025 at 12:00=E2=80=AFPM Chris Li wro= te: > > > > On Tue, Oct 14, 2025 at 2:27=E2=80=AFPM Chris Li wr= ote: > > > > > > On Sun, Oct 12, 2025 at 9:49=E2=80=AFAM Kairui Song wrote: > > > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > > > > > index cb2392ed8e0e..0d1924f6f495 100644 > > > > > > --- a/mm/swapfile.c > > > > > > +++ b/mm/swapfile.c > > > > > > @@ -1101,13 +1101,6 @@ static unsigned long cluster_alloc_swap_= entry(struct swap_info_struct *si, int o > > > > > > goto done; > > > > > > } > > > > > > > > > > > > - /* > > > > > > - * We don't have free cluster but have some clusters in= discarding, > > > > > > - * do discard now and reclaim them. > > > > > > - */ > > > > > > - if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled= _discard(si)) > > > > > > - goto new_cluster; > > > > > > > > > > Assume you follow my suggestion. > > > > > Change this to some function to detect if there is a pending disc= ard > > > > > on this device. Return to the caller indicating that you need a > > > > > discard for this device that has a pending discard. > > > > > Add an output argument to indicate the discard device "discard" i= f needed. > > > > > > > > The problem I just realized is that, if we just bail out here, we a= re > > > > forbidding order 0 to steal if there is any discarding cluster. We > > > > just return here to let the caller handle the discard outside > > > > the lock. > > > > > > Oh, yes, there might be a bit of change in behavior. However I can't > > > see it is such a bad thing if we wait for the pending discard to > > > complete before stealing and fragmenting the existing folio list. We > > > will have less fragments compared to the original result. Again, my > > > point is not that we always keep 100% the old behavior, then there is > > > no room for improvement. > > > > > > My point is that, are we doing the best we can in that situation, > > > regardless how unlikely it is. > > > > > > > > > > > It may just discard the cluster just fine, then retry from free clu= sters. > > > > Then everything is fine, that's the easy part. > > > > > > Ack. > > > > > > > But it might also fail, and interestingly, in the failure case we n= eed > > > > > > Can you spell out the failure case you have in mind? Do you mean the > > > discard did happen but another thread stole "the recently discarded > > > then became free cluster"? > > > > > > Anyway, in such a case, the swap allocator should continue and find > > > out we don't have things to discard now, it will continue to the > > > "steal from other order > 0 list". > > > > > > > to try again as well. It might fail with a race with another discar= d, > > > > in that case order 0 steal is still feasible. Or it fail with > > > > get_swap_device_info (we have to release the device to return here)= , > > > > in that case we should go back to the plist and try other devices. > > > > > > When stealing from the other order >0 list failed, we should try > > > another device in the plist. > > > > > > > > > > > This is doable but seems kind of fragile, we'll have something like > > > > this in the folio_alloc_swap function: > > > > > > > > local_lock(&percpu_swap_cluster.lock); > > > > if (!swap_alloc_fast(&entry, order)) > > > > swap_alloc_slow(&entry, order, &discard_si); > > > > local_unlock(&percpu_swap_cluster.lock); > > > > > > > > +if (discard_si) { > > > > > > I feel the discard logic should be inside the swap_alloc_slow(). > > > There is a plist_for_each_entry_safe(), inside that loop to do the > > > discard and retry(). > > > If I previously suggested it change in here, sorry I have changed my > > > mind after reasoning the code a bit more. > > > > Actually now I have given it a bit more thought, one thing I realized > > is that you might need to hold the percpu_swap_cluster lock all the > > time during allocation. That might force you to do the release lock > > and discard in the current position. > > > > If that is the case, then just making the small change in your patch > > to allow hold waiting to discard before trying the fragmentation list > > might be good enough. > > > > Chris > > > > Thanks, I was composing a reply on this and just saw your new comment. > I agree with this. Hmm, it turns out modifying V1 to handle non-order 0 allocation failure also has some minor issues. Every mTHP SWAP allocation failure will have a slight higher overhead due to the discard check. V1 is fine since it only checks discard for order 0, and order 0 alloc failure is uncommon and usually means OOM already. I'm not saying V1 is the final solution, but I think maybe we can just keep V1 as it is? That's easier for a stable backport too, and this is doing far better than what it was like. The sync discard was added in 2013 and the later added percpu cluster at the same year never treated it carefully. And the discard during allocation after recent swap allocator rework has been kind of broken for a while. To optimize it further in a clean way, we have to reverse the allocator's handling order of the plist and fast / slow path. Current order is local_lock -> fast -> slow (plist). We can walk the plist first, then do the fast / slow path: plist (or maybe something faster than plist but handles the priority) -> local_lock -> fast -> slow (bonus: this is more friendly to RT kernels too I think). That way we don't need to rewalk the plist after releasing the local_lock and save a lot of trouble. I remember I discussed with Youngjun on this sometime ago in the mail list, I know things have changed a lot but some ideas seems are still similar. I think his series is moving the percpu cluster into each device (si), we can also move the local_lock there, which is just what I'm talking about here.