From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 46ADDC5478C for ; Wed, 28 Feb 2024 02:46:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CB4216B0110; Tue, 27 Feb 2024 21:46:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C64956B0111; Tue, 27 Feb 2024 21:46:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B04D46B0112; Tue, 27 Feb 2024 21:46:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id A103B6B0110 for ; Tue, 27 Feb 2024 21:46:31 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 4A2B91C0DEB for ; Wed, 28 Feb 2024 02:46:31 +0000 (UTC) X-FDA: 81839674182.16.1C69C62 Received: from mail-vk1-f169.google.com (mail-vk1-f169.google.com [209.85.221.169]) by imf01.hostedemail.com (Postfix) with ESMTP id 79A8240006 for ; Wed, 28 Feb 2024 02:46:29 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=SePrWHSN; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf01.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.169 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709088389; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=68asS/bl/MvTXrNLASBMAC6AdOXFPqHpDjzmE7L2cfA=; b=xjJGeNRvHXQhsXdAIVwgLIHAiMLvxwhhlxdh5pl4FRCRwd7hjF4RpmTEAYe2KJATJbtK8d vuYtA2PqumJI0Xkhj1GwtUxHNXMktAgSDvQs2ztUezvn/NR7JCG3xod/+csQrC2xYuxfOt 84jtmwHAqLBunILS8VD0Vn99PqTPNqA= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=SePrWHSN; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf01.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.169 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709088389; a=rsa-sha256; cv=none; b=j/KU1agY0dSy6+h3PA1ZzpvaqFJ9NHCn3xVmKMKYdv7YEE/hjSB7n8deGGDJQlvWsQzQxE SrFbgjuHnWAjlk//z3UPRZ1sxwr8UWnR/kuHGAIMVYqcJgxb6KHUUcDquGB/NLS6qPqWH4 Cq+KjkIFe0LlFguJuOL9CmWcu7Y3c2M= Received: by mail-vk1-f169.google.com with SMTP id 71dfb90a1353d-4d13e46ac0dso666274e0c.3 for ; Tue, 27 Feb 2024 18:46:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709088388; x=1709693188; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=68asS/bl/MvTXrNLASBMAC6AdOXFPqHpDjzmE7L2cfA=; b=SePrWHSNzq0TAKtzMubFfqLB8HEivSRo9vApwre7hS9c1CsfWRT+PNrwHCl0HbHCbB bIGfUuHn5HBSG6/7rNYjbr0EFLSblefk7OgjbM2doFIekJQ+QWtl9ufwZ9pMkpZJWKMv Uc510x5YViwSgsgYQjWlIAjTHaXcR3cXQpTTShzRk2b4GdM/jpAo8rROZW6AZXNQc6xV HzIWAPMMESbeyWcs12P1pij5u+jKD24AHbM6rnQx8CKDavgwlSIKSGlLQ773tZO88vJX wUFuCVlx7LTd61yKOG2rssrJav/s4Fk1V4+3PljXkoBkhJqz7BcApLdS4P9uxA6pKAXd guVQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709088388; x=1709693188; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=68asS/bl/MvTXrNLASBMAC6AdOXFPqHpDjzmE7L2cfA=; b=qlMlYqeKJ1mA/u66zCcrI/VDJB44IuR6o7WRc3EAWflivS6DfT+w15y+FCjrnhE8gi nH0rK3IWrCLnEioaQ9xpwRX9S94gLe4HJ6uM2FU8IH0mmCSp7G+S1QcjdrJn4KoIMdDl 7STboB0WoNG6M1bH4jqan9kswJNDVXonVdckaZXOO1EwTkNUmoak3ozNrgoABneC1NyP Z9QU/lRnBsVuQr4IiFec58g+V6wKkOtWWjHhqilW1maKWMVdPG+fQ1MllCDSUtXeJX1y /cMIuAcokbsV8Uah/ToH6Px985t7RR65T22Tj7Yorx9ULPaMlghmArV2O00PdUsbc+ZK rsMQ== X-Forwarded-Encrypted: i=1; AJvYcCW0bQ3VL9a8fbt03Ol4L4hcLJuQGfacvWGqOGmYgY8LqHvrwZPJs+HkALLOmssWgT7sZU9nAhhl9lnWH/P9domgXXc= X-Gm-Message-State: AOJu0YztZI+cPmk8TxCPbcvNEd2L5shyYJSzTB7SvtUOY3otr9aFgTJM H8l3eKQ0ADPcVChMH5t4Wy5Egn+QevGfiaE8JaZ5gR0GpBJnyIc/7AIzDTps0ZHloTniX0TbVAq YONtB1NY0ns4VePs7AC216KiIfKo= X-Google-Smtp-Source: AGHT+IED+9sf543RzJirjUl5aNXvzu27yPceSLZRH9ht0dBKq9pS75a0eIKE4M/5BeJKrwvn86jBTogHFUlTizIXiMc= X-Received: by 2002:a1f:c4c2:0:b0:4cd:44db:b24b with SMTP id u185-20020a1fc4c2000000b004cd44dbb24bmr7350487vkf.5.1709088388368; Tue, 27 Feb 2024 18:46:28 -0800 (PST) MIME-Version: 1.0 References: <20231025144546.577640-5-ryan.roberts@arm.com> <20240205095155.7151-1-v-songbaohua@oppo.com> <6a55e785-73dd-4951-bad8-2670810a13b6@arm.com> In-Reply-To: <6a55e785-73dd-4951-bad8-2670810a13b6@arm.com> From: Barry Song <21cnbao@gmail.com> Date: Wed, 28 Feb 2024 15:46:17 +1300 Message-ID: Subject: Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting To: Ryan Roberts Cc: akpm@linux-foundation.org, david@redhat.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.com, shy828301@gmail.com, wangkefeng.wang@huawei.com, willy@infradead.org, xiang@kernel.org, ying.huang@intel.com, yuzhao@google.com, chrisl@kernel.org, surenb@google.com, hanchuanhua@oppo.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 79A8240006 X-Stat-Signature: wdomwumzzjf8wnfi3z3mwfcb3nuwm5h1 X-HE-Tag: 1709088389-160176 X-HE-Meta: U2FsdGVkX18qRNE2xzv6T1DhA4VH0bRJU3yTIgWm06q9aKCE+o2DxHyKB9pvMdiDd9FfzLKZtsVLCg/R2OshBkhnCc/RTKbFo84JV69QNITCTuvt4V/n7rvK1mkqLvW2iG/aqXM+EZ8sVEqPin0C/IkjHP5GcbsK7yRAOjPmIW2DMX1ndv5+igADEzj3YN8Gv7dbLFETteyUSlrBFrP6o80RBYKNu8ac63WzyLaQYoYcwcwXZ8zd8VdSpavYN+vrEbOMdoS2EMN3RUSlIXniHI3I3mk94UJ6xPSAPcf/Ckb249nCwgRvouzmgnzsTLxbJvHn6XY/teQbkYIgBqgHX9IS7W3a6adFfqIbAibpSAP9lwuie0vcgos4d0fnCHCWlOFz/w/sch4/UmA8TFVxaUcKsojqLwHKK8emNtg8W+GDZuvXSQ1uNw4ogTaj4i1F5IfEr0ZZ9bph3HmSmQyygs1gQ9JL3LFS879xv/Kw42Y+OGFoBXqIyB0lYdLdyK6bvV8IWPIZ2NtF5GBcdWfHRrAI2n7JnIDVjKbABLjVns71TcifjyoD4dhB95Vosk50KlTjLCOYF5AhhDvuXcqyQiBQEg4zzDcpPhJz+n7oPMghE523CGROSimm7f+YJDRDcXle/CUdSsoMC+BZkRzDNQlqfwmbe2flUb7vmuo4YPRilv4ggodRuecwJBPyp3gHfmxAO6h6TdFATPKlCc1UHN4v0kxzH3JdzeFmI5JmQXG5YTcrdg5RB/n1z74OW5+zMg2zoymWiNMuaJtGoKuRryfC2vWXqFzGEBMxDxFhAWWkTo1kt46IZ53DFYZBz0knserBOeTZWQ6Rx9i5Jvl924gR5s7OJOr1jf8bmu1W6Zb4yDMaDNXxbD9JOPxPu236PsdFrYGKDgV+Lce0PbhczDBXFbYDvHCBeQVLtnfaAjjLmcngn2bRYDdXjWa3monNqScIibBacRryz1A8LBa zusKyLJ4 wgQfASAWj6cQ2E/lBY8/Ta0fOtVe9ToZo1xt2E+OeLIfzLvBkSqbMcT7UOQAq3oOd6UNKHBr0nDC+cyUZA59DRd0126oTkJ7uZqKwENPMcVUlo8tKnnKT98LWc9qS2GLdxy2Vzcds1U+BnEQcXfcZg0zGaU+60fbCP/njbWDjDBR9oy7/Gnq8LsK4OLZLc0Gpm/MVZhowH0RIb1Yu2Ltkyv3TswkCmYUekm5qGvRqVRqdmtw7ffiWOQAKX6uQuZXrwPK/AB+KEY1LxErn5gft37TWt3UsiaOfiIW0KWGH1X5rTpYvbWN4y/UWKXPW3Ei+yT8zGay97sh20F8YoO52Y7NM7bIT80OYor9bKRRAaxEviWv7y1gMCzig9dsdOyUyU5V3qHOF8bFDlG7N0PQhwVYOHRnFfoXAOfr3Im6O39LS+OwcXO7ryFN8Ygsar2/KkecK6EqncilKDxAcBZh+KTn8esKH/4wFmk6w4oiZWVqzhfgffjxDyBYI4g== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Feb 28, 2024 at 2:37=E2=80=AFAM Ryan Roberts = wrote: > > On 05/02/2024 09:51, Barry Song wrote: > > +Chris, Suren and Chuanhua > > > > Hi Ryan, > > > >> + /* > >> + * __scan_swap_map_try_ssd_cluster() may drop si->lock during dis= card, > >> + * so indicate that we are scanning to synchronise with swapoff. > >> + */ > >> + si->flags +=3D SWP_SCANNING; > >> + ret =3D __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, = order); > >> + si->flags -=3D SWP_SCANNING; > > > > nobody is using this scan_base afterwards. it seems a bit weird to > > pass a pointer. > > > >> --- a/mm/vmscan.c > >> +++ b/mm/vmscan.c > >> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct l= ist_head *folio_list, > >> if (!can_split_folio(folio, NULL)= ) > >> goto activate_locked; > >> /* > >> - * Split folios without a PMD map= right > >> - * away. Chances are some or all = of the > >> - * tail pages can be freed withou= t IO. > >> + * Split PMD-mappable folios with= out a > >> + * PMD map right away. Chances ar= e some > >> + * or all of the tail pages can b= e freed > >> + * without IO. > >> */ > >> - if (!folio_entire_mapcount(folio)= && > >> + if (folio_test_pmd_mappable(folio= ) && > >> + !folio_entire_mapcount(folio)= && > >> split_folio_to_list(folio, > >> folio_lis= t)) > >> goto activate_locked; > >> -- > > > > Chuanhua and I ran this patchset for a couple of days and found a race > > between reclamation and split_folio. this might cause applications get > > wrong data 0 while swapping-in. > > > > in case one thread(T1) is reclaiming a large folio by some means, still > > another thread is calling madvise MADV_PGOUT(T2). and at the same time, > > we have two threads T3 and T4 to swap-in in parallel. T1 doesn't split > > and T2 does split as below, > Hi Ryan, > Hi Barry, > > Do you have a test case you can share that provokes this problem? And is = this a > separate problem to the race you solved with TTU_SYNC or is this solving = the > same problem? They are the same. After sending you the report about the races, I spent some time and finally figured out what was happening, why corrupted data came while swapping in. it is absolutely not your fault, but TTU_SYNC does somehow resolve my problem though it is n= ot the root cause. this corrupted data only can reproduce after applying patch 4[1] of swap-in series, [1] [PATCH RFC 4/6] mm: support large folios swapin as a whole https://lore.kernel.org/linux-mm/20240118111036.72641-5-21cnbao@gmail.com/ In case we have a large folio with 16 PTEs as below, and after add_to_swap(), they get swapoffset 0x10000, their PTEs are all present as they are still mapped. PTE pte_stat PTE0 present PTE1 present PTE2 present PTE3 present ... PTE15 present then we get to try_to_unmap_one, as try_to_unmap_one doesn't hold PTL from PTE0, while it scans PTEs, we might have PTE pte_stat PTE0 none (someone is writing PTE0 for various reasons) PTE1 present PTE2 present PTE3 present ... PTE15 present We hold PTL from PTE1. after try_to_unmap_one, PTEs become PTE pte_stat PTE0 present (someone finished the write of PTE0) PTE1 swap 0x10001 PTE2 swap 0x10002 PTE3 swap 0x10003 ... ... PTE15 swap 0x1000F Thus, after try_to_unmap_one, the large folio is still mapped. so its swapc= ache will still be there. Now a parallel thread runs MADV_PAGEOUT, and it finds this large folio is not completely mapped, so it splits the folio into 16 small folios but t= heir swap offsets are kept. Now in swapcache, we have 16 folios with contiguous swap offsets. MADV_PAGEOUT will reclaim these 16 folios, after new 16 try_to_unmap_one, PTE pte_stat PTE0 swap 0x10000 SWAP_HAS_CACHE PTE1 swap 0x10001 SWAP_HAS_CACHE PTE2 swap 0x10002 SWAP_HAS_CACHE PTE3 swap 0x10003 SWAP_HAS_CACHE ... PTE15 swap 0x1000F SWAP_HAS_CACHE >From this time, we can have various different cases for these 16 PTEs. for example, PTE pte_stat PTE0 swap 0x10000 SWAP_HAS_CACHE =3D 0 -> become false due to finished pageout and remove_mapping PTE1 swap 0x10001 SWAP_HAS_CACHE =3D 0 -> become false due to finished pageout and remove_mapping PTE2 swap 0x10002 SWAP_HAS_CACHE =3D 0 -> become false due to concurrent swapin and swapout PTE3 swap 0x10003 SWAP_HAS_CACHE =3D 1 ... PTE13 swap 0x1000D SWAP_HAS_CACHE =3D 1 PTE14 swap 0x1000E SWAP_HAS_CACHE =3D 1 PTE15 swap 0x1000F SWAP_HAS_CACHE =3D 1 but all of them have swp_count =3D 1 and different SWAP_HAS_CACHE. some of = these small folios might be in swapcache, some others might not be in. then we do_swap_page at one PTE whose SWAP_HAS_CACHE=3D0 and swap_count=3D1 (the folio is not in swapcache, thus has been written to swa= p), we do this check: static bool pte_range_swap(pte_t *pte, int nr_pages) { int i; swp_entry_t entry; unsigned type; pgoff_t start_offset; entry =3D pte_to_swp_entry(ptep_get_lockless(pte)); if (non_swap_entry(entry)) return false; start_offset =3D swp_offset(entry); if (start_offset % nr_pages) return false; type =3D swp_type(entry); for (i =3D 1; i < nr_pages; i++) { entry =3D pte_to_swp_entry(ptep_get_lockless(pte + i)); if (non_swap_entry(entry)) return false; if (swp_offset(entry) !=3D start_offset + i) return false; if (swp_type(entry) !=3D type) return false; } return true; } as those swp entries are contiguous, we will call swap_read_folio(). For those folios which are still in swapcache and haven't been written, we get zero-filled data from zRAM. So the root cause is that pte_range_swap should check all 16 swap_map have the same SWAP_HAS_CACHE as false. static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages) { ... count =3D si->swap_map[start_offset]; for (i =3D 1; i < nr_pages; i++) { entry =3D pte_to_swp_entry(ptep_get_lockless(pte + i)); if (non_swap_entry(entry)) return false; if (swp_offset(entry) !=3D start_offset + i) return false; if (swp_type(entry) !=3D type) return false; /* fallback to small folios if SWAP_HAS_CACHE isn't same */ if (si->swap_map[start_offset + i] !=3D count) return false; } return true; } but somehow TTU_SYNC "resolves" it by giving no chance to MADV_PAGEOUT to split this folio as the large folio are either entirely written by swap entries, or entirely keep present PTEs. Though the bug is within the swap-in series, I am still a big fan of TTU_SYNC for large folio reclamation for at least three reasons, 1. We remove some possibility that large folios fail to be reclaimed, impro= ving reclamation efficiency. 2. We avoid many strange cases and potential folio_split during reclamation= . without TTU_SYNC, folios can be splitted later, or partially being set to s= wap entries while partially being still present 3. we don't increase PTL contention. My test shows try_to_unmap_one will always get PTL after it sometimes skips one or two PTEs because intermediate break-before-makes are short. Of course, most time try_to_unma= p_one will get PTL from PTE0. > > Thanks, > Ryan > Thanks Barry