From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2364DC54E41 for ; Mon, 4 Mar 2024 21:42:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 82DE86B007E; Mon, 4 Mar 2024 16:42:10 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7B6196B0080; Mon, 4 Mar 2024 16:42:10 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 60A076B0081; Mon, 4 Mar 2024 16:42:10 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 4B4D56B007E for ; Mon, 4 Mar 2024 16:42:10 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id A45601C0D99 for ; Mon, 4 Mar 2024 21:42:09 +0000 (UTC) X-FDA: 81860679978.12.2AD5B37 Received: from mail-vs1-f42.google.com (mail-vs1-f42.google.com [209.85.217.42]) by imf01.hostedemail.com (Postfix) with ESMTP id D06844000B for ; Mon, 4 Mar 2024 21:42:07 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=BqqrYynq; spf=pass (imf01.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.42 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709588527; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=NnQ9TZxK8Utos2QIbLcEX2k1DvVdZ75yY2QmeGXtsJQ=; b=q0klejcHyKZK5mu8eAhpFRfcxBiqn26N67DLwmI74gYmn15obZz6zCl5t81vOrturTAlxu 0VdQHxWbLfz+DbLNfnnxxki/48MWgG9ud2ILtbxYmz1mQFAF/GCzJo/MlJkjid1AMHVTVB HtS0fxXj2rnmYOzeXfIZfq3qfBgLnHM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709588527; a=rsa-sha256; cv=none; b=7jFcrE+cFoh0EQvkl7onIEMGK2VUdYZ7dVFq53rSXb2dRIPKbQ9RTSbwoXN+Py5T80WlMU zJsMXMAxbLGW8EpDPLuYWenmso4Bk7U9sob459v7Rx/uKBu6LIdS0FloERRx9JRA1SjyKK bE2jt8TgjuXUbvjLTPH1+mBBQ5OREXg= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=BqqrYynq; spf=pass (imf01.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.42 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-vs1-f42.google.com with SMTP id ada2fe7eead31-472640aba12so1087422137.2 for ; Mon, 04 Mar 2024 13:42:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709588527; x=1710193327; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=NnQ9TZxK8Utos2QIbLcEX2k1DvVdZ75yY2QmeGXtsJQ=; b=BqqrYynqDqcoCuv9sQvP+/EgTxUwhdpADmf9oX8CVAzX9l3EN6i26VYSVx7LxRyI56 b3DXCS2gjjepvkaFUPao0UJNIl5ereQ4P+G0wiv46ooEjD1+m0mA2BkZHEOaNHD/Qy3V Y5TdSkh2lyII1/E6jB9cBuTYiE8Wp3vPDmy3aT9zl2FAraYbaxRaAHIuaflWFWt2hvHM d7DAye6kcGH+ziJU3FSU0V3oGS97npyRQNfj58AdTY67sTJUTSS8GvR9CxUvLlBO6avX siK8WTsbN9lAm6jpUG+lCgbvg2zhnKgub9/binqPXoNICFngUUZeb8hc8pxQP3VGH0yd zbrQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709588527; x=1710193327; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NnQ9TZxK8Utos2QIbLcEX2k1DvVdZ75yY2QmeGXtsJQ=; b=uovBePu3e8F/QUL93KmS7BBvlg5sPub3Hg91ucYbmC0hSNOTCCPY3iCXG8AsGnLGvO RlxM+8kTkevZk+B5nbLwfJVtCIppHXTbyFZOa17OQpMy+zRQweq789h3WJguiaNZsESZ 6PiDXrsbzGScsgwQcu5W9XipC7XYJXJQFIeSZ6jXamn6vt3J0Y8cmGeVyNBiMmwLS/dt Y1k0nH3FJ09iikUj5q9QkRpNpaBAWa+XfiwXJK51iM+HOwiTGJUDzNIwpebbj0/n1UUM qbCVAbPCKmdhAPSD4zZIXaoBDGqUM6Pxo1bhYtbSQp1JLGtHzKhzqfFVYPaVDrHXeNfc Ka1A== X-Forwarded-Encrypted: i=1; AJvYcCUwEz3tMCpOnCLMyoAijJ99jIsVi9DQ67CGFNY/xVk3rEhkCFFKN1dDf7gtUvAJj0PuwFEdC0mHY2unY7ESCexIxdQ= X-Gm-Message-State: AOJu0Yw3rV+seLDjDC2jxyLFZh9v328nxYiTUrY3g6KbEVjY54a4fgxv k1b2jc56Rebys/nb5nvHfU/RGUti2GFq64Cfk5CM8FCPrvHUgEa/1hHVRJfo/EiV/JgH9A49f0r SWGA+g9W8wTK9yNuM9W4R3xY39lE= X-Google-Smtp-Source: AGHT+IHC7YkwRM7t3tahWO6VJ1kRgg0ana4tHv4S7qTl9rbvdwwNgBdp3Xjy+IxQGVJdebUHmxLS+oFQCd7eJQN7ZDk= X-Received: by 2002:a05:6102:3748:b0:472:a717:acbf with SMTP id u8-20020a056102374800b00472a717acbfmr29273vst.24.1709588526852; Mon, 04 Mar 2024 13:42:06 -0800 (PST) MIME-Version: 1.0 References: <20240304103757.235352-1-21cnbao@gmail.com> <706b7129-85f6-4470-9fd9-f955a8e6bd7c@arm.com> <37f1e6da-412b-4bb4-88b7-4c49f21f5fe9@redhat.com> <10f9542e-f3d8-42b0-9de4-9867cab997b9@arm.com> <17b4527c-3782-4eab-8b33-e0c6ff57139f@redhat.com> <882fcbdd-5392-4dbf-99e4-b35defd9e3dc@redhat.com> In-Reply-To: <882fcbdd-5392-4dbf-99e4-b35defd9e3dc@redhat.com> From: Barry Song <21cnbao@gmail.com> Date: Tue, 5 Mar 2024 10:41:55 +1300 Message-ID: Subject: Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio To: David Hildenbrand Cc: Ryan Roberts , akpm@linux-foundation.org, linux-mm@kvack.org, chrisl@kernel.org, yuzhao@google.com, hanchuanhua@oppo.com, linux-kernel@vger.kernel.org, willy@infradead.org, ying.huang@intel.com, xiang@kernel.org, mhocko@suse.com, shy828301@gmail.com, wangkefeng.wang@huawei.com, Barry Song , Hugh Dickins Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: D06844000B X-Rspam-User: X-Stat-Signature: 3tbrpwfynyugo463ahk86p3mpgr5jw3m X-Rspamd-Server: rspam03 X-HE-Tag: 1709588527-997212 X-HE-Meta: U2FsdGVkX1+1nckgVT+FgI/I2Yl22Y6mtrTLDU2i8FrNlnTXwnUh81f4JNkCZoXyfuNJNdQ6pBwNzFzxv6WHLwD0sIFiXrxiEizsbKZ5UzBLh4j/YviP28GzEkzjQHri7C2GVDkkG3OxrnpSWr+gOeLU04m5prtIdHdXPWvdelz5R3fHUavMWkPAfWOt7QLdPBSZw1P6o5hafZO2t32yvuO9j1hwXlKyNNIOalCeJ7nCgBHzEhxy1RedgVSAnFaDsl8FSKTIJ0DPLC/Al3+/9lKINuDASXcfY1UuoiqvaRi4dvkp389XrLjbdT/TSBaWKxaYjoFtb7Ocb3yaAv/1zwNhh4vYVXmk6ImPw0ZC/csSckxZldzRcDRpkf8DKspb1d2NUoX7tLV23SdJqsRmf+6HdSgGDsi7e6oWagyhN/XkbywUkR+wHqpsAs5wigIe3xwyBGO0GvClSRtOTeISAYGD6HfktNvoOe5QgOTI+P2ba4KI8Ky01ZF1gBZwIZ/ILvAzxIhEgHzdyGQ9j33N4IUXWR4w3hhvjSNGSTHj0iH/9tcu1mB+my+mvCZemC0x1xf+zUnQpAk6iKAtmNMKcKnQQYwnlDCf/pLzZqvm1akkEeDDtJcaMICgKDE2eK54I+OHW3mCWZ0FsoQNf7nkSFNF8CRpeOyQJ76boJdX/mrUQ0XbTVl3c9f3DXBzTOsAwJuUYyn6tOWliHgWVU/iI6CLNa56TaKfXrKYtSaLZLF2ov6BtHYU5x8bW9h9fS+6xzBODv9nHbfpPyRpcKsbNYErpC9LcWevLrqheCF7e+bkRqQ2AhXTTyx+TW9pirobCyOQ3Zom/gBb6canJj4RMrG1rdmf4wZXa0B151uUVRKCwylJRsXMjhZmd1jKY6RD3k8kvS5EukNNzF7twCXdT6hR+7XvEWGBZ3K00mLLw8UQG4LI5Rmr1Z2A9AFF4pCpjMrQIun/QfTaWcHvKhm 6x5HDPKw eIN74FDpv0+C2yLz9s4nBmDL0wAFwfkaMSpOPb7czZkFXklhOqVM0lRAPJZGJzGxw8VOgh2WssUR1uKzKzRfaX7CN9pzq/BrQOOh1B68LrlNqvWUR8YM45MN1K0GKamBz0niWVg893ikVj5tJ/pmc/OSc6SMf5REwL9WD5KHW/FoI67sAxIKIJ3HUhUJ1hAYH11rABdAtLv07KA9W07h0DkWSUgv26QkkZi/2+gCXDsS2q9SRJM3jjgiM8Uz4r7wq1cLpnQ9DmkrYPWtrk2qhiW/RQ5gpQE3o8w3yiDBlzGAuthVVFDqeWdGoILijKj00gKTUuqCDWBXTngKeCV9WcdFx4KRFM8uYcvNSW0ATyLZQwQQNXbJDLVEtqcoS6MWu5V5IEOR96b3xgf/7UViOlpOHjjO/nNhSzyhoRmEYDyffNgk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Mar 5, 2024 at 10:02=E2=80=AFAM David Hildenbrand wrote: > > On 04.03.24 21:42, Barry Song wrote: > > On Tue, Mar 5, 2024 at 3:27=E2=80=AFAM David Hildenbrand wrote: > >> > >> On 04.03.24 14:03, Ryan Roberts wrote: > >>> On 04/03/2024 12:41, David Hildenbrand wrote: > >>>> On 04.03.24 13:20, Ryan Roberts wrote: > >>>>> Hi Barry, > >>>>> > >>>>> On 04/03/2024 10:37, Barry Song wrote: > >>>>>> From: Barry Song > >>>>>> > >>>>>> page_vma_mapped_walk() within try_to_unmap_one() races with other > >>>>>> PTEs modification such as break-before-make, while iterating PTEs > >>>>>> of a large folio, it will only begin to acquire PTL after it gets > >>>>>> a valid(present) PTE. break-before-make intermediately sets PTEs > >>>>>> to pte_none. Thus, a large folio's PTEs might be partially skipped > >>>>>> in try_to_unmap_one(). > >>>>> > >>>>> I just want to check my understanding here - I think the problem oc= curs for > >>>>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large= folios? Now > >>>>> that I've had a look at the code and have a better understanding, I= think that > >>>>> must be the case? And therefore this problem exists independently o= f my work to > >>>>> support swap-out of mTHP? (From your previous report I was under th= e impression > >>>>> that it only affected mTHP). > >>>>> > >>>>> Its just that the problem is becoming more pronounced because with = mTHP, > >>>>> PTE-mapped large folios are much more common? > >>>> > >>>> That is my understanding. > >>>> > >>>>> > >>>>>> For example, for an anon folio, after try_to_unmap_one(), we may > >>>>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries= . > >>>>>> So folio will be still mapped, the folio fails to be reclaimed. > >>>>>> What=E2=80=99s even more worrying is, its PTEs are no longer in a = unified > >>>>>> state. This might lead to accident folio_split() afterwards. And > >>>>>> since a part of PTEs are now swap entries, accessing them will > >>>>>> incur page fault - do_swap_page. > >>>>>> It creates both anxiety and more expense. While we can't avoid > >>>>>> userspace's unmap to break up unified PTEs such as CONT-PTE for > >>>>>> a large folio, we can indeed keep away from kernel's breaking up > >>>>>> them due to its code design. > >>>>>> This patch is holding PTL from PTE0, thus, the folio will either > >>>>>> be entirely reclaimed or entirely kept. On the other hand, this > >>>>>> approach doesn't increase PTL contention. Even w/o the patch, > >>>>>> page_vma_mapped_walk() will always get PTL after it sometimes > >>>>>> skips one or two PTEs because intermediate break-before-makes > >>>>>> are short, according to test. Of course, even w/o this patch, > >>>>>> the vast majority of try_to_unmap_one still can get PTL from > >>>>>> PTE0. This patch makes the number 100%. > >>>>>> The other option is that we can give up in try_to_unmap_one > >>>>>> once we find PTE0 is not the first entry we get PTL, we call > >>>>>> page_vma_mapped_walk_done() to end the iteration at this case. > >>>>>> This will keep the unified PTEs while the folio isn't reclaimed. > >>>>>> The result is quite similar with small folios with one PTE - > >>>>>> either entirely reclaimed or entirely kept. > >>>>>> Reclaiming large folios by holding PTL from PTE0 seems a better > >>>>>> option comparing to giving up after detecting PTL begins from > >>>>>> non-PTE0. > >>>>>> > >>>> > >>>> I'm sure that wall of text can be formatted in a better way :) . Als= o, I think > >>>> we can drop some of the details, > >>>> > >>>> If you need some inspiration, I can give it a shot. > >>>> > >>>>>> Cc: Hugh Dickins > >>>>>> Signed-off-by: Barry Song > >>>>> > >>>>> Do we need a Fixes tag? > >>>>> > >>>> > >>>> What would be the description of the problem we are fixing? > >>>> > >>>> 1) failing to unmap? > >>>> > >>>> That can happen with small folios as well IIUC. > >>>> > >>>> 2) Putting the large folio on the deferred split queue? > >>>> > >>>> That sounds more reasonable. > >>> > >>> Isn't the real problem today that we can end up writng a THP to the s= wap file > >>> (so 2M more IO and space used) but we can't remove it from memory, so= no actual > >>> reclaim happens? Although I guess your (2) is really just another way= of saying > >>> that. > >> > >> The same could happen with small folios I believe? We might end up > >> running into the > >> > >> folio_mapped() > >> > >> after the try_to_unmap(). > >> > >> Note that the actual I/O does not happen during add_to_swap(), but > >> during the pageout() call when we find the folio to be dirty. > >> > >> So there would not actually be more I/O. Only swap space would be > >> reserved, that would be used later when not running into the race. > > > > I am not worried about small folios at all as they have only one PTE. > > so the PTE is either completely unmapped or completely mapped. > > > > In terms of large folios, it is a different story. for example, a large > > folio with 16 PTEs with CONT-PTE, we will have > > > > 1. unfolded CONT-PTE, eg. PTE0 present, PTE1-PTE15 swap entries > > > > 2. page faults on PTE1-PTE15 after try_to_unmap if we access them. > > > > This is totally useless PF and can be avoided if we can try_to_unmap > > properly at the beginning. > > > > 3. potential need to split a large folio afterwards. for example, MADV_= PAGEOUT, > > MADV_FREE might split it after finding it is not completely mapped. > > > > For small folios, we don't have any concern on the above issues. > > Right, but when we talk about "Fixes:", what exactly are we consider > "really broken" above and what is "undesired"? > > (a) is there a correctness issue? I don't think so. > > (b) is there a real performance issue? I'd like to understand. > > After all, we've been living with that ever since we supported THP_SWAP, > correct? "something does not work ideally in some corner cases" might be > reasonable to handle here (and I really think we should), but might not > be worth a "Fixes:". > > So if we could clarify that, it would be great. I don't think this needs a fixes tag, and I would think it is an optimizati= on on corner cases. And I don't think we will see noticeable performance improvement as this isn't happening quite often though I believe it does improve the performance of corner cases. but if the corner case only takes 0.1% of all try_to_unmap_one, no noticeable performance improvement will be seen. I feel it is more of a behavior to kick flies out while flies don't kill pe= ople but it can be sometimes quite annoying :-) I ran into another bug in large-folio swapin series due to this problem, 16 PTEs had contiguous swap entries from Ryan's add_to_swap(), but they were splitted in MADV_PAGEOUT because the folio was not completely mapped after try_to_unmap_one. Then after some time, some of the 16 pages were in swapcache, while others were not in. In this case, I couldn't swap them in together and had = to handle PF one by one, but i was incorrectly handling it and trying to swap-in them together by reading swapfile. if they were atomically handled in try_to_unmap_one, we could avoid this. we may have to cope with this kind of problem from time to time in future work. > > -- > Cheers, > > David / dhildenb Thanks Barry