From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7D842C48BC4 for ; Tue, 20 Feb 2024 04:56:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 10BB18D0003; Mon, 19 Feb 2024 23:56:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0BB3E8D0001; Mon, 19 Feb 2024 23:56:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EC59A8D0003; Mon, 19 Feb 2024 23:56:37 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id DD38B8D0001 for ; Mon, 19 Feb 2024 23:56:37 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id B436980340 for ; Tue, 20 Feb 2024 04:56:37 +0000 (UTC) X-FDA: 81810971634.05.C2F66D3 Received: from mail-lj1-f172.google.com (mail-lj1-f172.google.com [209.85.208.172]) by imf19.hostedemail.com (Postfix) with ESMTP id C93001A0012 for ; Tue, 20 Feb 2024 04:56:35 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=E1U87jcg; spf=pass (imf19.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.172 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708404995; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wg3T1pEc7CYdM3PwDVHR28RzeLutlMFhPKCwmDpZrlA=; b=z0CW1qHFDC/TZO2TS7RKUnlHxfRvLnJwYysDfiuzpE5fDTWwB7O3cWoWPMPRc8dQWNcNZg rhARpGFxPW3tj65zQ0Gg3ZSGwvhip5UsFTk0KsJdCsArKyc0JNR7bieTfv3Dhb3jLQWFqq AQ3Qby4pTzBBB3+6teBZlgCIwuW3LLo= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708404995; a=rsa-sha256; cv=none; b=5psOekjPLF0Lukn19+UXT6qYym8+cl0xRg5LINlPlpocS6TbTbILsWOKUz4pXS+fNNX1Qr qvMYCboqSeG0rwAEn6XJXGDYxCHw8r4Exc8rv7+YKwA3MM9heVPETtaU1ncKMLen4wuPPa xHRnt93EWx1W7+MolOHuhZK4xsodrJc= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=E1U87jcg; spf=pass (imf19.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.172 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-lj1-f172.google.com with SMTP id 38308e7fff4ca-2d0cdbd67f0so71524481fa.3 for ; Mon, 19 Feb 2024 20:56:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1708404994; x=1709009794; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=wg3T1pEc7CYdM3PwDVHR28RzeLutlMFhPKCwmDpZrlA=; b=E1U87jcgzF136U12/atPdS5JAnJf5RaOBDXh+jsv9Nm2IDfzNHJ47pXQZy3NV9Eh6e 3MJ7/J3yX64mggbkZ26wxKUEJTsthiOfccEOsUfmPFSFt4u07/2E822yPQUkt8j/1EzC Su1Pyl90onsZDtNIEo0P31VV9boaUjcg0DL6oYW3PeKxlryshI5r0sMBAsmOVuaQH0Ah zTXM42UqpgRa5QudGK5muAwsHwo/Nnfd314NTmOAkiZplEMZElCxCtQ8WpFgPYAYAVqW 7/7CBvCVYqgQ1GPAm+CFkx7onH2CHUZH86iqsO9PYHBlWedWInxblDhTmjxL/l/8dXgU Z8kA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708404994; x=1709009794; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wg3T1pEc7CYdM3PwDVHR28RzeLutlMFhPKCwmDpZrlA=; b=io0jojV+PAQNQslj1Ib1/787nN3ddS87RQE+BDuOW9swY1KR/DYEPTZbPHw6a8gUgq K1JAXFP7QX1RnvQEPZucZ0Ay1xP9XsjkMAIRsqDRezQA18i0JUNm9eVS6/iOwMHZfA8C abljPS2uMi3GXRX7I4nqlSfmb747KCBXd5dblNfnliHNT4Eb1S/rWWFi4cmFBADh/1D3 AitArgpLQeyNnHznMLblyeG2ufguHi1qYjF96ikvhyUSgkclaRVDadQRO5MMl/e9Vk9J OtXUOx32J9L4byYrzTxiKKNHiPROaTiJ7SScYeQ4HtTJ+jxdH6UqueLtVPtZvWXV3HOp pEDQ== X-Forwarded-Encrypted: i=1; AJvYcCVtqJYRY7XAlSr+VWgAxtvslBNq1oh49tOcNzm3tH6OpqMFxJ9kJDT50qisqDLCzokkLDkOu66n1EjGfcs3kwtXonc= X-Gm-Message-State: AOJu0Yzz2P9RM3dduqXSYro1RAaV7tr8U7RKAPugCa6SfORaB3YjiQiT QdBx9ejTFQg06imDVq+R0Np2K7fYSsKTRrTaqLHGr3C+eoTFz3JrfTul5XZQF6aqae78Be8qaYC pGdC03+g3mdlztf4UN53+0845LVM= X-Google-Smtp-Source: AGHT+IE51DNAgivlKygoc8YbW1wKrzKBcOJ4CPEsfVHtg+lnqpbB4fNtvhyJqsoLe19BPgghFJKRz+xa/iUE3D8jCkc= X-Received: by 2002:a2e:860e:0:b0:2d1:59:9474 with SMTP id a14-20020a2e860e000000b002d100599474mr9764115lji.48.1708404993637; Mon, 19 Feb 2024 20:56:33 -0800 (PST) MIME-Version: 1.0 References: <20240219082040.7495-1-ryncsn@gmail.com> <20240219173147.3f4b50b7c9ae554008f50b66@linux-foundation.org> In-Reply-To: From: Kairui Song Date: Tue, 20 Feb 2024 12:56:16 +0800 Message-ID: Subject: Re: [PATCH v4] mm/swap: fix race when skipping swapcache To: Barry Song <21cnbao@gmail.com> Cc: Andrew Morton , linux-mm@kvack.org, "Huang, Ying" , Chris Li , Minchan Kim , Barry Song , Yu Zhao , SeongJae Park , David Hildenbrand , Hugh Dickins , Johannes Weiner , Matthew Wilcox , Michal Hocko , Yosry Ahmed , Aaron Lu , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: jjg4mwn775z1apmy6fk6qr8faar9xqj1 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: C93001A0012 X-Rspam-User: X-HE-Tag: 1708404995-729206 X-HE-Meta: U2FsdGVkX1+exDcv20+JVbA++uUDCEMvIiSigvW3CFrlCeULfC9+nrtl8E3SuUHja6rodviMBIPwqKZ54TsA6HTzhvYcp2Bh2eERJy/338g4biEMpj6MRbtP4XhSm8a16fbNNnB3izt1ucg027UT1cy7t7p7stPGGUuAiArRDAtDqnLt2FzGeRpkH9EgDmqlBUNlnXsUidrKXBbKbZE/q9SAjwexGiwvLSqSgVTS1meosOEzxeA7luCwcbVXUbU7Mjs4FOGocTMdrvfZ/IdWAdW/rAoc9MVmxzE+dA6oQmp5Vb6nqaP1sYKI6avkHion6HhxAZhh7ebzaB6K8iS+34SYrIdJA+4VGkJuG0Tnk81fY1B5OgOO2jLxHQSpD1uVpOR5qaVXZV1RIphXmXf2EAjLGiTRtFraV1qTbaSOV8ww2d9lxAu5WeGLJA0HFo4ktaRzaPRK6GS9hgy4JDOWVrGcsikXXvgGBPrHPZOHuz2pYadAMPCRNY3+5++49mLug0NdxlAQYEHsI9BLo50Q7+dVvRAhsdHNMt8i/R7PT17g9va97zSoGMsMpxA2w22ql2h35sRrfSsYOjH2mnGsqbPPBJ+LEISV58TwkMv8Alhi7Y2c8bzTEOz22b9qZcUFfOB6t1mmRpbxJdFDLpZs4Jox1brNcGgmoassEEvGa/LPM3UIW0oJA1tkcKtd8PeN0EefLJ/8nRKkgUMTUsccl+5buh9S34LdyoBzXdujV83Idzl/uz2UWfwVA9E2NPutNUtpgdFIkkMC19NH4ySMF5OfT8qrTw9h6hlzU1Wp/2ePQz8XdsxbZMSvgwu5FIQU0jSDk9CoPJ3pHsF+PnFpvOaYAGreEsL7znbLa4FxPa4GUMnjMWJ4QzSk7Nn+3xuG4ARnTtGVutbRctcuAUj9kWbo+khxMlFZpiiuFnL6fdgWs/l3iaA1odurvc4QRsDasfoQMUpMpPZd/CKLdyU XHF/ww5q SVi0WDCxP0YDNK5w3NCjTp7nzGj6bP3dGpHuklL5GEkJ2ng7tQ7wYRbmhhVCKCXqLRDTTqMkYw6xmDkiGy4eo9KxCOG2N30SPctOjpI0oHTA8hvtIBCgUI4iw+FzyGdihVdYoqNYDAALulCDNpofpImHF7mEvcGqQcuwOPjrLwTK8WKnd65I/4hjYgT1OViZiAgqfwebvgsNU+pJCS0zKCfS3seSXvS5uD7ka74vhyqUZPK+DpZIlyXz3RWAFNdGU50xi/UPpyf98K4Yl6xzWcw+lqG0cGHP355EmJstjvrSzKYSVQTxKYC6ttWgU3VHV7cCJ/ke1G8M3Wkq0qTBG7EhqUbbtYwPqZhbOd8w+SRcAymkd6ZSD9WYgaoToEjuJk1tXxLakr+vnMn1AnoVJBwFZow99qBrlcmEKz0znuXAcOMeb9v/fiF1iq1yOkljoRm3rbOwIPVgV7ZSaZwJJmepR3iUmG3FZSevkwgHPbK9IEJL2VP63WhUkGw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Feb 20, 2024 at 12:01=E2=80=AFPM Barry Song <21cnbao@gmail.com> wro= te: > > On Tue, Feb 20, 2024 at 4:42=E2=80=AFPM Kairui Song wr= ote: > > > > On Tue, Feb 20, 2024 at 9:31=E2=80=AFAM Andrew Morton wrote: > > > > > > On Mon, 19 Feb 2024 16:20:40 +0800 Kairui Song wro= te: > > > > > > > From: Kairui Song > > > > > > > > When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more thre= ads > > > > swapin the same entry at the same time, they get different pages (A= , B). > > > > Before one thread (T0) finishes the swapin and installs page (A) > > > > to the PTE, another thread (T1) could finish swapin of page (B), > > > > swap_free the entry, then swap out the possibly modified page > > > > reusing the same entry. It breaks the pte_same check in (T0) becaus= e > > > > PTE value is unchanged, causing ABA problem. Thread (T0) will > > > > install a stalled page (A) into the PTE and cause data corruption. > > > > > > > > @@ -3867,6 +3868,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf= ) > > > > if (!folio) { > > > > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && > > > > __swap_count(entry) =3D=3D 1) { > > > > + /* > > > > + * Prevent parallel swapin from proceeding wi= th > > > > + * the cache flag. Otherwise, another thread = may > > > > + * finish swapin first, free the entry, and s= wapout > > > > + * reusing the same entry. It's undetectable = as > > > > + * pte_same() returns true due to entry reuse= . > > > > + */ > > > > + if (swapcache_prepare(entry)) { > > > > + /* Relax a bit to prevent rapid repea= ted page faults */ > > > > + schedule_timeout_uninterruptible(1); > > > > > > Well this is unpleasant. How often can we expect this to occur? > > > > > > > The chance is very low, using the current mainline kernel and ZRAM, > > even with threads set to race on purpose using the reproducer I > > provides, for 647132 page faults it occured 1528 times (~0.2%). > > > > If I run MySQL and sysbench with 128 threads and 16G buffer pool, with > > 6G cgroup limit and 32G ZRAM, it occured 1372 times for 40 min, > > 109930201 page faults in total (~0.001%). > Hi Barry, > it might not be a problem for throughput. but for real-time and tail late= ncy, > this hurts. For example, this might increase dropping frames of UI which > is an important parameter to evaluate performance :-) > That's a true issue, as Chris mentioned before I think we need to think of some clever data struct to solve this more naturally in the future, similar issue exists for cached swapin as well and it has been there for a while. On the other hand I think maybe applications that are extremely latency sensitive should try to avoid swap on fault? A swapin could cause other issues like reclaim, throttled or contention with many other things, these seem to have a higher chance than this race. > BTW, I wonder if ying's previous proposal - moving swapcache_prepare() > after swap_read_folio() will further help decrease the number? We can move the swapcache_prepare after folio alloc or cgroup charge, but I didn't see an observable change from statistics, for some workload the reading is even worse. I think that's mostly due to noise, or higher swap out rate since all raced threads will alloc an extra folio now. Applications that have many pages swapped out due to memory limit are already on the edge of triggering another reclaim, so a dozen more folio alloc could just trigger that... And we can't move it after swap_read_folio()... That's exactly what we want to protect.