From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 02DD2D1119D
	for <linux-mm@archiver.kernel.org>; Mon,  4 Nov 2024 03:06:20 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 134ED6B007B; Sun,  3 Nov 2024 22:06:20 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0E53B6B0082; Sun,  3 Nov 2024 22:06:20 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id EEE6D6B0083; Sun,  3 Nov 2024 22:06:19 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id D0DCA6B007B
	for <linux-mm@kvack.org>; Sun,  3 Nov 2024 22:06:19 -0500 (EST)
Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 7131EAC0F6
	for <linux-mm@kvack.org>; Mon,  4 Nov 2024 03:06:19 +0000 (UTC)
X-FDA: 82746922776.27.6C72A2E
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.15])
	by imf12.hostedemail.com (Postfix) with ESMTP id 64FD14001E
	for <linux-mm@kvack.org>; Mon,  4 Nov 2024 03:06:02 +0000 (UTC)
Authentication-Results: imf12.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=PSvVU9MX;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.15 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730689519; a=rsa-sha256;
	cv=none;
	b=ywWtX/FKvQW9gWxNV9qz5UvXfrypJN8ICgZVJ4wqkrwtNOm2F6OXKkDdYHG8wfh1FMK+g/
	Scs9RlMoDqRD5DMgsYDjP73X/4X86ZPJMjAP/zponfHIwT5/xD6lvZr8XtRG6dQQ4QOCFU
	CZpyyOLP5XEOUrTDbac5AMQ9qoSxpFY=
ARC-Authentication-Results: i=1;
	imf12.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=PSvVU9MX;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.15 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1730689519;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=CP3fXAo9vQWWBmB9SQmSuZ3+pgQgtB2KMe5MyyZtcRY=;
	b=b/H57/vczksWvRa1ltw5g6TwNdranx4nkSqaSJZZoyHIY3eqICCYrMCK6ZiEAqTKGPCMne
	rUi6iWI3XNpY0RFHga+y9GNnkbAfA+QYGXzCrRZtuBf9B1ECfAKvsvWJrVifDOvEL3LiT/
	ujBbwyDdBTvagk7cLO3tUhs9fle1SHY=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1730689577; x=1762225577;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=QTTFhhynf90oEftLRH7Mi3h+YQnNU1R+VgGLuDsWm98=;
  b=PSvVU9MXBX5BmuRM91q29yV5iTw/qKT7l9M/JLjC05+4OA8/5PqjfBYI
   duGhE/iWKZIVOT02GwHGcVIDYiCG1ODOvzJ+6bT5/PHxqx9M18a88BXH3
   0i8g/WWgy/0sPiL+O9o+/Sx5Qh0m8iJlAgmddMDeIWGViOCXCW08cBLOW
   QYdRM7XghL8FgddJCmaQqsZUnvbeBsuaHorR4uNfbmkyQPiPIJwWa+rpX
   1XvyGMcnAUvsd34RoasPxJ2k1G9NitmIhfxKB2Jn6zoQTnsw3FybEHNVu
   nt9AAHIYX/D7i0t+GauLERs53FbNrw+VLVo/DY/bDuA5iecd1WqlEerGQ
   A==;
X-CSE-ConnectionGUID: gnbXhdtRRsOZeGX1P1oIXw==
X-CSE-MsgGUID: NXKpGrxiRs+rOS4enS71Uw==
X-IronPort-AV: E=McAfee;i="6700,10204,11245"; a="30483197"
X-IronPort-AV: E=Sophos;i="6.11,256,1725346800"; 
   d="scan'208";a="30483197"
Received: from fmviesa009.fm.intel.com ([10.60.135.149])
  by fmvoesa109.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Nov 2024 19:06:15 -0800
X-CSE-ConnectionGUID: WBXp1XW1QUWuXsB9HTKW/g==
X-CSE-MsgGUID: MFqWvvW0RjyoDj/kFxzUOg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.11,256,1725346800"; 
   d="scan'208";a="83618851"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmviesa009-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Nov 2024 19:06:13 -0800
From: "Huang, Ying" <ying.huang@intel.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: linux-mm <linux-mm@kvack.org>,  Chris Li <chrisl@kernel.org>
Subject: Re: cluster_alloc_swap_entry() may change per-cpu cluster of
 another CPU randomly
In-Reply-To: <CAMgjq7CFKeVN7kQ9Ta--hcCs1Ng4KpMTa-xjs58_1EsbjSO6UQ@mail.gmail.com>
	(Kairui Song's message of "Mon, 4 Nov 2024 01:11:30 +0800")
References: <871q0641xd.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAMgjq7ArBOZjTquDr5BhN+xtqKLbE3dV-7rXB-mzz_UL_m3HNQ@mail.gmail.com>
	<87a5es3f1f.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CAMgjq7CFKeVN7kQ9Ta--hcCs1Ng4KpMTa-xjs58_1EsbjSO6UQ@mail.gmail.com>
Date: Mon, 04 Nov 2024 11:02:41 +0800
Message-ID: <87ed3r8zvy.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Queue-Id: 64FD14001E
X-Rspamd-Server: rspam01
X-Stat-Signature: qt5p91jnmj6bothz8u11xkkgu1zu8j7s
X-HE-Tag: 1730689562-691949
X-HE-Meta: U2FsdGVkX1+x72hXvH1PQhEBorwfzBCmVO1NJ/ynBND5CHp3tS7wlTKqvb+Fx2KMfGMGqSb4myKa8HrnqCj7aN10pldE18BcHhbfNXSStTnTQZJ6WscuWFptkIMKhF3IyZ5WTq/v2lSzSKX49x/2F4KjnaoA2Vww4Mdi/SfBlcO1mH2mM9YNe3d0qJ/UR9NwQZAAtUUfBcAK+CJLWJ/yEYUSd1FuCbhh+zh6Ku2+YcgaSoqgOVK//jnX60xAGzh6AEioAeIi0ZaFP5NBp5yy8oolCS+c/8TtTJvtneJJhOb7vP7ngqL3mvbGxavmJQXPlzSEXZQjegiyYJrtlMI5mcutXqAjy6Il8vqWh+/SRtN2J2ooPHfLopJbR5b/fU+l5qui1aDzL0R+eI69ZONxuC//3zJhmo4FzrZJ/88aTIvM0Hhsm+PLVQfXp1cMIjFPplSpDmZyYWy1HZ6L9+ANVbv2tqZNsJWh7cID+jfsNQ71DpMAofamsS10E1GjZ3U1B2B/OZYf+RH9ofwFHs7oNbLh/BQCAo5N6q+4a5iiRTcWrN0E/2DpuzcNYGdlEqFFPp75EaOXnyyC/Jg8Q3WRdCjxcMKAt5K0TrYQZiD0OJZMU1JFG+uaHyPCtosZLt7FEgRxNKPqd08LFApmpKZ94hK5ZTqeuQeVYO/mDgPyNE6ENLOh90VF61FfRPnYeaNw4PIj/nSznBgcFZF92jPWRlijOruylgbBoUcG/kpBM6ZdNYmB+5HuQHng+tmfiVSci913adHbWf1LkuD6H4XN5ryRxIXTegLsLL9UaQk+hp19f6UND7bW4YaLAazJjzKMyL8JaSNpVKtt9w1BEU7mpLwppQeN5dukUMezVtI2UT73qJ8CEyIxJdIFr3xEKrrWFPY850TkwYcCyAyWIgolaZipcKeMXNt+rUTmefENrjsVsxw++XphMkzoJumVYWeA+CeXOVQk2UX7cgrp8vB
 sa9lGDpO
 VssNSNpyJX9UaDLI6x+1ViYk4D5gYoO4V2HSANJwivkWY7Bf3yIMlJCDcTvk27ShNOPrviozMuuU8cRuaTwaaS8dZ8g43zk4c27mqGY6qTbrkIaGcQgBiDg87LC7LcicYISj3oiryg+rEcdiPA3E2fc99j2O8dpHWgFwvLe2+wlaqorUhv4WLWK1KDQCfkrWOZOQk1/ATJc3MwcqpAD+thGZusp8+mWFJreZVVB9D6HeSpCG5lRisO+7bmNfcnnSBURZ7NvkFREnNxBldNBK+L0aDn0p6BiU6qs4/
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Kairui Song <ryncsn@gmail.com> writes:

> Hi Ying,
>
> On Fri, Oct 25, 2024 at 1:56=E2=80=AFPM Huang, Ying <ying.huang@intel.com=
> wrote:
>>
>> Kairui Song <ryncsn@gmail.com> writes:
>>
>> > On Thu, Oct 24, 2024 at 11:29=E2=80=AFAM Huang, Ying <ying.huang@intel=
.com> wrote:
>> >>
>> >> Hi, Kairui,
>> >>
>> >> When reading new swap cluster allocation code.  Sorry, I didn't find
>> >> time to review your patch at the first place.  I found that in
>> >> cluster_alloc_swap_entry(), the following code path is possible,
>> >>
>> >>   cluster =3D this_cpu_ptr(si->percpu_cluster);
>> >>   offset =3D alloc_swap_scan_cluster();
>> >>     ...
>> >>       spin_unlock(&ci->lock);
>> >>       spin_unlock(&si->lock);
>> >>       /* migrate to another cpu */
>> >>       spin_lock(&ci->lock);
>> >>       spin_lock(&si->lock);
>> >>   cluster->next[order] =3D offset;
>> >>
>> >> That is, the per cpu cluster of a CPU may be changed on another CPU. =
 I
>> >> guess that this will not cause some functionality issue.  However, th=
is
>> >> makes code harder to be reasoned.  Is it possible to avoid unlock bef=
ore
>> >> changing per cpu cluster?  Or, give up updating per cpu cluster if we
>> >> need to unlock.
>> >
>> > Hi Ying
>> >
>> > Yes, I've noticed this when working on the new si->lock series. It may
>> > cause fragmentation, only if the CPU the current allocation migrates
>> > from requested another swap allocation, in such case it may waste one
>> > per cpu cluster.
>> >
>> > No functionality or leak issue, but make things harder to reason
>> > indeed, as you mentioned.
>> >
>> > In the new series I used a local lock to prevent migration, so the cpu
>> > is stable during the whole process. However that is not doable unless
>> > si->lock is removed from allocation path or there are lock dependency
>> > issue.
>> >
>> > Do we need to fix this separately in the current tree? The
>> > fragmentation is overall better with these patches, so I think that is
>> > causing any issues. Before these patches per CPU cluster can also be
>> > stolen by other CPUs so these were not guaranteed to be accurate.
>> >
>> > One easy way we can do is set the cluster->next[order] to 0 before
>> > scanning for a new cluster to use, and check if it is 0 before setting
>> > next[order] again.
>> >
>> > That may still cause fragmentation as migration will cause extra
>> > clusters to be scanned and used, which is hard to avoid without
>> > changing too much code (like the things being done in the new si->lock
>> > series).
>>
>> Find another possible more serious problem.  After unlock/lock cycle,
>> the state of swap device may be changed (for example, swapoffed).  So in
>> the previous code,
>>
>>         si->flags +=3D SWP_SCANNING;
>>
>>         if (!(si->flags & SWP_WRITEOK))
>>
>> is used for that.  I don't find they are used after unlock/lock cycle.
>> Can you check it?
>
> Thanks for your review.
>
> Yes, I think you are right. I did notice the flags are not updated
> properly in the latest mm branch, but did not notice it was a newly
> introduced issue.
>
> So I fixed the issue in
> https://lore.kernel.org/linux-mm/20241022192451.38138-8-ryncsn@gmail.com/
>
> But a quick fix can be posted before the series. Following patch
> should be enough, how do you think?
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 46bd4b1a3c07..6b7d4ac1d4a3 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1041,7 +1041,9 @@ static int cluster_alloc_swap(struct swap_info_stru=
ct *si,
>
>         VM_BUG_ON(!si->cluster_info);
>
> -       while (n_ret < nr) {
> +       si->flags +=3D SWP_SCANNING;
> +       while (n_ret < nr && si->flags & SWP_WRITEOK) {
>                 unsigned long offset =3D cluster_alloc_swap_entry(si,
> order, usage);
>
>                 if (!offset)
> @@ -1049,6 +1051,8 @@ static int cluster_alloc_swap(struct swap_info_stru=
ct *si,
>                 slots[n_ret++] =3D swp_entry(si->type, offset);
>         }
>
> +       si->flags -=3D SWP_SCANNING;
> +
>         return n_ret;
>  }
>
> The "+=3D SWP_SCANNING" looks ugly, but has to be done in such a way,
> just like in the original allocation code. Multiple scanners can
> update the flag, and SCANNING is the highest defined bit in the flag
> field, it's not a strictly bit flag in such cases. This will be
> cleaned up in a proper way in the patch I just mentioned.

I think that the logic of SWP_SCANNING is correct.

As for SWP_WRITEOK, we should avoid to allocate swap entries when other
task clears it ASAP.  So, the above checking may be too late, because we
have already allocated the swap entry.  I think that we can check it in
cluster_reclaim_range() and refuse to allocate swap entry there.

Combined with the per cpu cluster (cluster->next[]) changing issue we
discussed in another email of the thread.  I think the long term
solution could be separate the swap entry allocation and reclaiming a
bit more.  For example, we could try to allocate some swap entries with
si->lock held firstly, record possible reclaim offset if any, change
cluster->next[], then unlock, reclaim, lock.

--
Best Regards,
Huang, Ying