From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0364AC48260 for ; Mon, 19 Feb 2024 05:44:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2CC076B0081; Mon, 19 Feb 2024 00:44:29 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 27C646B0082; Mon, 19 Feb 2024 00:44:29 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1436D6B0083; Mon, 19 Feb 2024 00:44:29 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 067EC6B0081 for ; Mon, 19 Feb 2024 00:44:29 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 67B70160209 for ; Mon, 19 Feb 2024 05:44:28 +0000 (UTC) X-FDA: 81807463416.11.0620F50 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8]) by imf29.hostedemail.com (Postfix) with ESMTP id BD85112000F for ; Mon, 19 Feb 2024 05:44:25 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=LKZYjfn8; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.8 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708321466; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GeMsg1DvmD9BgditWoNo7wuszMzk+MkAKqL9rqkcdqY=; b=m6K94YFCsnVCBlxpfvvZgyIxGGhWjFPr5V1u5hBpyh/ulIVVpSu3Tr8sUXYxbBGqrR08kW 3ItduVpJ0jMxjW84p7B7lCGbNp6SjAg8tHfh/bzuy+JD5ybhaaXWI6YJ/EQciJ9v0cHZ8N BVyAR4YWDovWnIow7pWXAX4daA/rnSE= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=LKZYjfn8; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.8 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708321466; a=rsa-sha256; cv=none; b=7T02O5W2x0zUyaEcj/kam/e1l4S6CAWPG4+XtnUBF4jJ0Kip4d+BBbWaPjLZV0tidskkFz VhRXrtS2cF9JCov+DZ0a4goJBgIKuJheMN3XtC6GSZ+e0LJ4+iu6zzVUMdUvAm7BEA4lKT npiIXRfqAeRldTrO8LIxYYpUjMCNQ1k= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1708321466; x=1739857466; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=A1WpzfYCZjFPQC654yh/PRs0p5i5tAcZYGFQ3TvRRw4=; b=LKZYjfn8XclzCNqXxCtyhfFDe/JEEZUcOQRs+zzsXWz6sp4rPYHtXoWz 21MX6hWrHXdMFPnd03C5gAmhWIDR7A1FUo5/eU4WXAJq1W2Tw1Ig00wGT yPL80dbyCow2vL7+T+JCMW50T4hc2RX54V8AkaOmlfLNBT5PEJ8xysPxX TDI/d4UvzZ3zKhOQzrwqb9UmlXFbzTeskMil4XsDuhkBsTVi4kXPYuulM /zUuT8m9XagGVzB+UuVJVyeiYqDY8ioIzJM7MCKZxFISLXi2LCBPHQ0yM zyuE7FP34pDIeFiiaf99Giavsdb1/29hA0TYYtDjdN9jYHHgRLcwkG05P A==; X-IronPort-AV: E=McAfee;i="6600,9927,10988"; a="19913342" X-IronPort-AV: E=Sophos;i="6.06,170,1705392000"; d="scan'208";a="19913342" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmvoesa102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Feb 2024 21:44:24 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10988"; a="826951277" X-IronPort-AV: E=Sophos;i="6.06,170,1705392000"; d="scan'208";a="826951277" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Feb 2024 21:44:19 -0800 From: "Huang, Ying" To: Kairui Song Cc: Minchan Kim , Chris Li , Barry Song <21cnbao@gmail.com>, linux-mm@kvack.org, Andrew Morton , Yu Zhao , Barry Song , SeongJae Park , Hugh Dickins , Johannes Weiner , Matthew Wilcox , Michal Hocko , Yosry Ahmed , David Hildenbrand , stable@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v2] mm/swap: fix race when skipping swapcache In-Reply-To: (Kairui Song's message of "Fri, 9 Feb 2024 03:01:20 +0800") References: <20240206182559.32264-1-ryncsn@gmail.com> <87eddnxy47.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Mon, 19 Feb 2024 13:42:23 +0800 Message-ID: <877cj1row0.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: 6xkyqg3bpuadu1fykqqs74spajojpd36 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: BD85112000F X-HE-Tag: 1708321465-877186 X-HE-Meta: U2FsdGVkX1/8XktN3khAv7TNHJRkYcdiWvNZuHK72TIlE7JqZ1rUxtERloF25miFFtZ8V2xTYMamp8EidvrJDRnb0Y0mAwP+WBYRHfyURHbcF/tANhfh8bPrcv4mMcTT/67mPm/IWIhjzii7nfMobvQO7+U0y9i9GJcZV1Dy5iCztQEnBiamY+nnNIU/9/OacvShYRBl72XgCtiRPDKWglihkyM4kRrR9Xox/WVP9Pf8CjEwyzSAY18DjWAFwNEDlnu3EXtX/EeN8ODFl1ywZfUFb35HeRG8IkhPeEzwYJ3c9QDkgLZ8NzOciAQpQq5vrByTRcBglC9BkoHftkLK44G3HT///85sm1H967mSpHHd3tcNB2SIaXe6SViWhaFYrVbFUCQvfdX6up9/yp6mscpJkCEOGN0ToN9kGcyIZOjzWEj+tz/AsZcygFEfU65piE12Mq1BGzE3tlKPgjTKdulWkA5/nV9WehKq12GlLywSzaPHe7hE6Dlrqg6z6BnSaaJG91Loi/DDA4rpvjlUmHb1QL/bReVyBQIvp0AmUnsYhzri66xmduG2+6uIk4gSn7H3COBS+UOikgG2Xf266wHSJ0wZeeIOwtOJWoUOqKQ/aKKlW2PGrO0fn3ej/Ed6VkyMaeMtq9oY+ICGAE16stnD6Bxg4dPQw1ZwJXisnIYT+zYz/vC4zsO8PSDO6W+Ug2A8pYM2ChDkdB3I1G8n5UQX1+DII3TgFNs7QikP7+cInmwx9qJTDEVIRjc5s3LBgpbxUGxnDd9vIcmCAH33+oejqQT7UygdBT5HgkfGZXpYX7vHkl9d+i9NrBG+HprZ3hXtWuekzkc6wU8AIGfuWpptSsGwykLT X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Kairui Song writes: > On Thu, Feb 8, 2024 at 2:36=E2=80=AFPM Huang, Ying = wrote: >> >> Kairui Song writes: >> >> > On Thu, Feb 8, 2024 at 2:31=E2=80=AFAM Minchan Kim wrote: >> >> >> >> On Wed, Feb 07, 2024 at 12:06:15PM +0800, Kairui Song wrote: >> >> [snip] >> >> >> > >> >> > So I think the thing is, it's getting complex because this patch >> >> > wanted to make it simple and just reuse the swap cache flags. >> >> >> >> I agree that a simple fix would be the important at this point. >> >> >> >> Considering your description, here's my understanding of the other id= ea: >> >> Other method, such as increasing the swap count, haven't proven effec= tive >> >> in your tests. The approach risk forcing racers to rely on the swap c= ache >> >> again and the potential performance loss in race scenario. >> >> >> >> While I understand that simplicity is important, and performance loss >> >> in this case may be infrequent, I believe swap_count approach could b= e a >> >> suitable solution. What do you think? >> > >> > Hi Minchan >> > >> > Yes, my main concern was about simplicity and performance. >> > >> > Increasing swap_count here will also race with another process from >> > releasing swap_count to 0 (swapcache was able to sync callers in other >> > call paths but we skipped swapcache here). >> >> What is the consequence of the race condition? > > Hi Ying, > > It will increase the swap count of an already freed entry, this race > with multiple swap free/alloc logic that checks if count =3D=3D > SWAP_HAS_CACHE or sets count to zero, or repeated free of an entry, > all result in random corruption of the swap map. This happens a lot > during stress testing. You are right! Thanks for explanation. -- Best Regards, Huang, Ying >> >> > So the right step is: 1. Lock the cluster/swap lock; 2. Check if still >> > have swap_count =3D=3D 1, bail out if not; 3. Set it to 2; >> > __swap_duplicate can be modified to support this, it's similar to >> > existing logics for SWAP_HAS_CACHE. >> > >> > And swap freeing path will do more things, swapcache clean up needs to >> > be handled even in the bypassing path since the racer may add it to >> > swapcache. >> > >> > Reusing SWAP_HAS_CACHE seems to make it much simpler and avoided many >> > overhead, so I used that way in this patch, the only issue is >> > potentially repeated page faults now. >> > >> > I'm currently trying to add a SWAP_MAP_LOCK (or SWAP_MAP_SYNC, I'm bad >> > at naming it) special value, so any racer can just spin on it to avoid >> > all the problems, how do you think about this? >> >> Let's try some simpler method firstly. > > Another simpler idea is, add a schedule() or > schedule_timeout_uninterruptible(1) in the swapcache_prepare failure > path before goto out (just like __read_swap_cache_async). I think this > should ensure in almost all cases, PTE is ready after it returns, also > yields more CPU.