From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 02DD2D1119D for ; Mon, 4 Nov 2024 03:06:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 134ED6B007B; Sun, 3 Nov 2024 22:06:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0E53B6B0082; Sun, 3 Nov 2024 22:06:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EEE6D6B0083; Sun, 3 Nov 2024 22:06:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id D0DCA6B007B for ; Sun, 3 Nov 2024 22:06:19 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 7131EAC0F6 for ; Mon, 4 Nov 2024 03:06:19 +0000 (UTC) X-FDA: 82746922776.27.6C72A2E Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.15]) by imf12.hostedemail.com (Postfix) with ESMTP id 64FD14001E for ; Mon, 4 Nov 2024 03:06:02 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=PSvVU9MX; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.15 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730689519; a=rsa-sha256; cv=none; b=ywWtX/FKvQW9gWxNV9qz5UvXfrypJN8ICgZVJ4wqkrwtNOm2F6OXKkDdYHG8wfh1FMK+g/ Scs9RlMoDqRD5DMgsYDjP73X/4X86ZPJMjAP/zponfHIwT5/xD6lvZr8XtRG6dQQ4QOCFU CZpyyOLP5XEOUrTDbac5AMQ9qoSxpFY= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=PSvVU9MX; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.15 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730689519; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CP3fXAo9vQWWBmB9SQmSuZ3+pgQgtB2KMe5MyyZtcRY=; b=b/H57/vczksWvRa1ltw5g6TwNdranx4nkSqaSJZZoyHIY3eqICCYrMCK6ZiEAqTKGPCMne rUi6iWI3XNpY0RFHga+y9GNnkbAfA+QYGXzCrRZtuBf9B1ECfAKvsvWJrVifDOvEL3LiT/ ujBbwyDdBTvagk7cLO3tUhs9fle1SHY= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1730689577; x=1762225577; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=QTTFhhynf90oEftLRH7Mi3h+YQnNU1R+VgGLuDsWm98=; b=PSvVU9MXBX5BmuRM91q29yV5iTw/qKT7l9M/JLjC05+4OA8/5PqjfBYI duGhE/iWKZIVOT02GwHGcVIDYiCG1ODOvzJ+6bT5/PHxqx9M18a88BXH3 0i8g/WWgy/0sPiL+O9o+/Sx5Qh0m8iJlAgmddMDeIWGViOCXCW08cBLOW QYdRM7XghL8FgddJCmaQqsZUnvbeBsuaHorR4uNfbmkyQPiPIJwWa+rpX 1XvyGMcnAUvsd34RoasPxJ2k1G9NitmIhfxKB2Jn6zoQTnsw3FybEHNVu nt9AAHIYX/D7i0t+GauLERs53FbNrw+VLVo/DY/bDuA5iecd1WqlEerGQ A==; X-CSE-ConnectionGUID: gnbXhdtRRsOZeGX1P1oIXw== X-CSE-MsgGUID: NXKpGrxiRs+rOS4enS71Uw== X-IronPort-AV: E=McAfee;i="6700,10204,11245"; a="30483197" X-IronPort-AV: E=Sophos;i="6.11,256,1725346800"; d="scan'208";a="30483197" Received: from fmviesa009.fm.intel.com ([10.60.135.149]) by fmvoesa109.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Nov 2024 19:06:15 -0800 X-CSE-ConnectionGUID: WBXp1XW1QUWuXsB9HTKW/g== X-CSE-MsgGUID: MFqWvvW0RjyoDj/kFxzUOg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,256,1725346800"; d="scan'208";a="83618851" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa009-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Nov 2024 19:06:13 -0800 From: "Huang, Ying" To: Kairui Song Cc: linux-mm , Chris Li Subject: Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly In-Reply-To: (Kairui Song's message of "Mon, 4 Nov 2024 01:11:30 +0800") References: <871q0641xd.fsf@yhuang6-desk2.ccr.corp.intel.com> <87a5es3f1f.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Mon, 04 Nov 2024 11:02:41 +0800 Message-ID: <87ed3r8zvy.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 64FD14001E X-Rspamd-Server: rspam01 X-Stat-Signature: qt5p91jnmj6bothz8u11xkkgu1zu8j7s X-HE-Tag: 1730689562-691949 X-HE-Meta: U2FsdGVkX1+x72hXvH1PQhEBorwfzBCmVO1NJ/ynBND5CHp3tS7wlTKqvb+Fx2KMfGMGqSb4myKa8HrnqCj7aN10pldE18BcHhbfNXSStTnTQZJ6WscuWFptkIMKhF3IyZ5WTq/v2lSzSKX49x/2F4KjnaoA2Vww4Mdi/SfBlcO1mH2mM9YNe3d0qJ/UR9NwQZAAtUUfBcAK+CJLWJ/yEYUSd1FuCbhh+zh6Ku2+YcgaSoqgOVK//jnX60xAGzh6AEioAeIi0ZaFP5NBp5yy8oolCS+c/8TtTJvtneJJhOb7vP7ngqL3mvbGxavmJQXPlzSEXZQjegiyYJrtlMI5mcutXqAjy6Il8vqWh+/SRtN2J2ooPHfLopJbR5b/fU+l5qui1aDzL0R+eI69ZONxuC//3zJhmo4FzrZJ/88aTIvM0Hhsm+PLVQfXp1cMIjFPplSpDmZyYWy1HZ6L9+ANVbv2tqZNsJWh7cID+jfsNQ71DpMAofamsS10E1GjZ3U1B2B/OZYf+RH9ofwFHs7oNbLh/BQCAo5N6q+4a5iiRTcWrN0E/2DpuzcNYGdlEqFFPp75EaOXnyyC/Jg8Q3WRdCjxcMKAt5K0TrYQZiD0OJZMU1JFG+uaHyPCtosZLt7FEgRxNKPqd08LFApmpKZ94hK5ZTqeuQeVYO/mDgPyNE6ENLOh90VF61FfRPnYeaNw4PIj/nSznBgcFZF92jPWRlijOruylgbBoUcG/kpBM6ZdNYmB+5HuQHng+tmfiVSci913adHbWf1LkuD6H4XN5ryRxIXTegLsLL9UaQk+hp19f6UND7bW4YaLAazJjzKMyL8JaSNpVKtt9w1BEU7mpLwppQeN5dukUMezVtI2UT73qJ8CEyIxJdIFr3xEKrrWFPY850TkwYcCyAyWIgolaZipcKeMXNt+rUTmefENrjsVsxw++XphMkzoJumVYWeA+CeXOVQk2UX7cgrp8vB sa9lGDpO VssNSNpyJX9UaDLI6x+1ViYk4D5gYoO4V2HSANJwivkWY7Bf3yIMlJCDcTvk27ShNOPrviozMuuU8cRuaTwaaS8dZ8g43zk4c27mqGY6qTbrkIaGcQgBiDg87LC7LcicYISj3oiryg+rEcdiPA3E2fc99j2O8dpHWgFwvLe2+wlaqorUhv4WLWK1KDQCfkrWOZOQk1/ATJc3MwcqpAD+thGZusp8+mWFJreZVVB9D6HeSpCG5lRisO+7bmNfcnnSBURZ7NvkFREnNxBldNBK+L0aDn0p6BiU6qs4/ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Kairui Song writes: > Hi Ying, > > On Fri, Oct 25, 2024 at 1:56=E2=80=AFPM Huang, Ying wrote: >> >> Kairui Song writes: >> >> > On Thu, Oct 24, 2024 at 11:29=E2=80=AFAM Huang, Ying wrote: >> >> >> >> Hi, Kairui, >> >> >> >> When reading new swap cluster allocation code. Sorry, I didn't find >> >> time to review your patch at the first place. I found that in >> >> cluster_alloc_swap_entry(), the following code path is possible, >> >> >> >> cluster =3D this_cpu_ptr(si->percpu_cluster); >> >> offset =3D alloc_swap_scan_cluster(); >> >> ... >> >> spin_unlock(&ci->lock); >> >> spin_unlock(&si->lock); >> >> /* migrate to another cpu */ >> >> spin_lock(&ci->lock); >> >> spin_lock(&si->lock); >> >> cluster->next[order] =3D offset; >> >> >> >> That is, the per cpu cluster of a CPU may be changed on another CPU. = I >> >> guess that this will not cause some functionality issue. However, th= is >> >> makes code harder to be reasoned. Is it possible to avoid unlock bef= ore >> >> changing per cpu cluster? Or, give up updating per cpu cluster if we >> >> need to unlock. >> > >> > Hi Ying >> > >> > Yes, I've noticed this when working on the new si->lock series. It may >> > cause fragmentation, only if the CPU the current allocation migrates >> > from requested another swap allocation, in such case it may waste one >> > per cpu cluster. >> > >> > No functionality or leak issue, but make things harder to reason >> > indeed, as you mentioned. >> > >> > In the new series I used a local lock to prevent migration, so the cpu >> > is stable during the whole process. However that is not doable unless >> > si->lock is removed from allocation path or there are lock dependency >> > issue. >> > >> > Do we need to fix this separately in the current tree? The >> > fragmentation is overall better with these patches, so I think that is >> > causing any issues. Before these patches per CPU cluster can also be >> > stolen by other CPUs so these were not guaranteed to be accurate. >> > >> > One easy way we can do is set the cluster->next[order] to 0 before >> > scanning for a new cluster to use, and check if it is 0 before setting >> > next[order] again. >> > >> > That may still cause fragmentation as migration will cause extra >> > clusters to be scanned and used, which is hard to avoid without >> > changing too much code (like the things being done in the new si->lock >> > series). >> >> Find another possible more serious problem. After unlock/lock cycle, >> the state of swap device may be changed (for example, swapoffed). So in >> the previous code, >> >> si->flags +=3D SWP_SCANNING; >> >> if (!(si->flags & SWP_WRITEOK)) >> >> is used for that. I don't find they are used after unlock/lock cycle. >> Can you check it? > > Thanks for your review. > > Yes, I think you are right. I did notice the flags are not updated > properly in the latest mm branch, but did not notice it was a newly > introduced issue. > > So I fixed the issue in > https://lore.kernel.org/linux-mm/20241022192451.38138-8-ryncsn@gmail.com/ > > But a quick fix can be posted before the series. Following patch > should be enough, how do you think? > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 46bd4b1a3c07..6b7d4ac1d4a3 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -1041,7 +1041,9 @@ static int cluster_alloc_swap(struct swap_info_stru= ct *si, > > VM_BUG_ON(!si->cluster_info); > > - while (n_ret < nr) { > + si->flags +=3D SWP_SCANNING; > + while (n_ret < nr && si->flags & SWP_WRITEOK) { > unsigned long offset =3D cluster_alloc_swap_entry(si, > order, usage); > > if (!offset) > @@ -1049,6 +1051,8 @@ static int cluster_alloc_swap(struct swap_info_stru= ct *si, > slots[n_ret++] =3D swp_entry(si->type, offset); > } > > + si->flags -=3D SWP_SCANNING; > + > return n_ret; > } > > The "+=3D SWP_SCANNING" looks ugly, but has to be done in such a way, > just like in the original allocation code. Multiple scanners can > update the flag, and SCANNING is the highest defined bit in the flag > field, it's not a strictly bit flag in such cases. This will be > cleaned up in a proper way in the patch I just mentioned. I think that the logic of SWP_SCANNING is correct. As for SWP_WRITEOK, we should avoid to allocate swap entries when other task clears it ASAP. So, the above checking may be too late, because we have already allocated the swap entry. I think that we can check it in cluster_reclaim_range() and refuse to allocate swap entry there. Combined with the per cpu cluster (cluster->next[]) changing issue we discussed in another email of the thread. I think the long term solution could be separate the swap entry allocation and reclaiming a bit more. For example, we could try to allocate some swap entries with si->lock held firstly, record possible reclaim offset if any, change cluster->next[], then unlock, reclaim, lock. -- Best Regards, Huang, Ying