From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B1D99C5478C for ; Mon, 4 Mar 2024 20:43:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 232986B007E; Mon, 4 Mar 2024 15:43:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1E28C6B0081; Mon, 4 Mar 2024 15:43:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 082CD6B0082; Mon, 4 Mar 2024 15:43:14 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id E8BAA6B007E for ; Mon, 4 Mar 2024 15:43:14 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id C37F240285 for ; Mon, 4 Mar 2024 20:43:14 +0000 (UTC) X-FDA: 81860531508.05.201B25D Received: from mail-vk1-f170.google.com (mail-vk1-f170.google.com [209.85.221.170]) by imf22.hostedemail.com (Postfix) with ESMTP id 2E366C0015 for ; Mon, 4 Mar 2024 20:43:13 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=GcNotsb6; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf22.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.170 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709584993; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Z1LgDgX5Y416n1LzNajXIdabLE3wCI+9ZIIldaxVr/8=; b=Vq6nV/SuDaVtUMHDEbNPIsUICakN1sXtws5AoK6rnUssNI6leXhLizVN1j3kMmS2xPrqZE tYlHbMVm4HgR8HW37JQ1RLzooBNqWCyl5YvIzxDaaOIY/f+UzMWRz9pJf1debsO5QAcj3X sYR8rQA6JEdpjKKxSqzgHAVOj8TyB30= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=GcNotsb6; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf22.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.170 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709584993; a=rsa-sha256; cv=none; b=5yu2LgcnvsmltnWYygaKlgK3M4/jc/BaHvTPtkkdXjQCXGIpz+AQh0puP2JZBbbekY3UJ6 ZrR1MfzQT/hnJXRfEiXlEc72WKgqxxUl6DGjCvtcyLzGrsGAeAy6I2Nlnsk9B5awCH+Tn1 IJYsHu7VNy3iwo++F5P1rFVtvjBVT8c= Received: by mail-vk1-f170.google.com with SMTP id 71dfb90a1353d-4d3634a8015so810379e0c.1 for ; Mon, 04 Mar 2024 12:43:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709584992; x=1710189792; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Z1LgDgX5Y416n1LzNajXIdabLE3wCI+9ZIIldaxVr/8=; b=GcNotsb6prwDqAqXS7uQVEH6ijs7dQgDr6OpreeuDQMsEFlT7EKU+a9UnIodEEm0Mi B82CLRGnOeRCqL3SgSwo/ltTVlA7mt/lDqEx8qfNZbz/PlWtuokEn9e+M5oHaRRTSkba YfXxG0DGL7WPdgqe44kosC0ObYrYyjyfIgprpsZl5DuYv056GjhWVvQO7Nso8G8tDJWj ah1O7mw7AslE+d9bEqmy0OenJwj6fMzOna8vVlmfHQC0/isgSOUcqAx1GsVq+IJKTlIa 44AWUzMs+rkX4Nu2tseKyrkO8XxL0LafEMhMz3PZrGvHGO3ftyPWg6hiFSSVCU46x0B7 OLIg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709584992; x=1710189792; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Z1LgDgX5Y416n1LzNajXIdabLE3wCI+9ZIIldaxVr/8=; b=MG/EyeO+ZkTL4P537+XMuWd9E9HIVqe42AiO/APOSZgHKxcLEU6QO9FpBAU3U7Poc9 GhjbgzLlKd+EFpnB0G7ilnTqNW1awBOZeDA1F+xOS8f/XFdIhliEDBKTLSojUume8vhn qpjpbInHL5BHqqkRttNShqun+LHgOIgEQCke6zBWurcbxM5s7T79K14itL6qnaCq9FlA moE0NnI0VyO0+dp9W6J6ljljXV1e7AC9/rWuSQNjThKlhSpSS4FaRqrKVa5k0k93uQ0N Zo7m2No1/ntjm64naQoKdoTkpSaSeRgTSHbxex1pTJX1yQgnj/QDd6G3Ta+xj2MGikom fqvg== X-Forwarded-Encrypted: i=1; AJvYcCUpzLKaV8chFNDek9KUBdFtDqHn3slQu+tjlotzgkagEtqeJBTA5XSQ2z3+hvcuhNc15ZMNUe4EwTxgWt/Poewr9uY= X-Gm-Message-State: AOJu0Yy6c6yBpB7V3ymBGx+xCI995zEBd8wDjcf1zik3W606kEXjxT/o QvIKRXY5R+VQ1pm8Tl5WjpjSA5A9RKfIWY7ccl2gsB3QGROIUk4NiIsF5+NaPvnmqg8BxuQtizF c7kksCN8xWZZ7DUt8fTPrxokPr6I= X-Google-Smtp-Source: AGHT+IHlRZXM1Ngzq33fC2WvirQblSfXQwjIju8YdhpzzX/4QbzTtngdSvFNfFqaF1b6QaUO70Q/NTL7KkYIqGC6Fu8= X-Received: by 2002:a05:6102:18c3:b0:470:6fe6:5a08 with SMTP id jj3-20020a05610218c300b004706fe65a08mr7087738vsb.34.1709584992126; Mon, 04 Mar 2024 12:43:12 -0800 (PST) MIME-Version: 1.0 References: <20240304103757.235352-1-21cnbao@gmail.com> <706b7129-85f6-4470-9fd9-f955a8e6bd7c@arm.com> <37f1e6da-412b-4bb4-88b7-4c49f21f5fe9@redhat.com> <10f9542e-f3d8-42b0-9de4-9867cab997b9@arm.com> <17b4527c-3782-4eab-8b33-e0c6ff57139f@redhat.com> In-Reply-To: <17b4527c-3782-4eab-8b33-e0c6ff57139f@redhat.com> From: Barry Song <21cnbao@gmail.com> Date: Tue, 5 Mar 2024 09:42:59 +1300 Message-ID: Subject: Re: [RFC PATCH] mm: hold PTL from the first PTE while reclaiming a large folio To: David Hildenbrand Cc: Ryan Roberts , akpm@linux-foundation.org, linux-mm@kvack.org, chrisl@kernel.org, yuzhao@google.com, hanchuanhua@oppo.com, linux-kernel@vger.kernel.org, willy@infradead.org, ying.huang@intel.com, xiang@kernel.org, mhocko@suse.com, shy828301@gmail.com, wangkefeng.wang@huawei.com, Barry Song , Hugh Dickins Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: zw35rnf4qrc3iqutbcpht157w5nh4bmb X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 2E366C0015 X-HE-Tag: 1709584993-603620 X-HE-Meta: U2FsdGVkX18LoTgq4c0JkEqe874/r1Tmla1Szc7EN5pvx2wQgU/+Y4b4+u0TPydKauNgevRX1d0PslJ/Z6pioM70jdiJzM0qo0+uHO2J/pvzt1xK3iVq/gyoActAmgvKa90G1iGfrSCmexdVMdcoUfwFkKKBWyAMMdfrrlUkPTt8995fGy7Qj3Uk2m+gmUIzvyEg5PKudvrUw/a1+lww6wgfyXLcorCtuf0hTEF1LiCfJrcIaDygkhuDXpDurWxNDLFHVPJxoRK9rKowpGwbPBSX5bwdVvPvC7yhWpDhl4DjsEdrEDdjkcBuCdlO23cYB7JPNBBn0+XBBbEYwNQ4xSipltJ0iftt9PXV0VX9MRGamr5fNOtkQLyMnllBXWVbVDrU19AyzIyiqurW1s5v9CzztEIWO1rMurrD2at6wgKwSTtXuKtqRp/cMR2T+4aWIU/ta+mWuWbWvvPv5SZgARS3fZyEBgwvXrK8lFt0l8KFbtuJwQutJJ4B+Q3bjFkEt8DUfOUAOm9ePFJkjUdPYJ4vCDvl6yr/2MrvP89rTFUaxJpSKTJBurUncOQTN2qdm4gP45Cliby74BrLqV8HkBlpd1OuCwesYvpyXen4CV+bzLYUGGk8TYoP0wxT/eq8RQLysRvEULNyrROFGh38XsXBr7MlVRtd7alkg3vFwQbvbMkzTvtX6pFXmhxB/8E0jgeTQ2DHVhzirndX8CijH6PpmMctMvsg5L4tHI+aFXD91J1j9FvEMV5pwY0qtbdrNuxXxyHVwKCv62RUUOO8MHCaW8/XNkahXHPGW10o28/gdhRJTsrUIIRjd43wL8JY+fU24lIzbR8koeDFVSdKOokePjT3Y2BcIkOG3GICX052JlPK3iAFu3HiFwauRNto98GVj+qy22nvdi4NjAy412UsgLRZp8dWajCGmazvHPa3SyjpVR5LUDHhM4NaSz5HXJ8mTDrTuTUQUr1jgrz YC27EP0/ 8t4t3DMfxBR6O1ER8Sr2a02cjXWpsaSeXUHAh X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Mar 5, 2024 at 3:27=E2=80=AFAM David Hildenbrand = wrote: > > On 04.03.24 14:03, Ryan Roberts wrote: > > On 04/03/2024 12:41, David Hildenbrand wrote: > >> On 04.03.24 13:20, Ryan Roberts wrote: > >>> Hi Barry, > >>> > >>> On 04/03/2024 10:37, Barry Song wrote: > >>>> From: Barry Song > >>>> > >>>> page_vma_mapped_walk() within try_to_unmap_one() races with other > >>>> PTEs modification such as break-before-make, while iterating PTEs > >>>> of a large folio, it will only begin to acquire PTL after it gets > >>>> a valid(present) PTE. break-before-make intermediately sets PTEs > >>>> to pte_none. Thus, a large folio's PTEs might be partially skipped > >>>> in try_to_unmap_one(). > >>> > >>> I just want to check my understanding here - I think the problem occu= rs for > >>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large f= olios? Now > >>> that I've had a look at the code and have a better understanding, I t= hink that > >>> must be the case? And therefore this problem exists independently of = my work to > >>> support swap-out of mTHP? (From your previous report I was under the = impression > >>> that it only affected mTHP). > >>> > >>> Its just that the problem is becoming more pronounced because with mT= HP, > >>> PTE-mapped large folios are much more common? > >> > >> That is my understanding. > >> > >>> > >>>> For example, for an anon folio, after try_to_unmap_one(), we may > >>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries. > >>>> So folio will be still mapped, the folio fails to be reclaimed. > >>>> What=E2=80=99s even more worrying is, its PTEs are no longer in a un= ified > >>>> state. This might lead to accident folio_split() afterwards. And > >>>> since a part of PTEs are now swap entries, accessing them will > >>>> incur page fault - do_swap_page. > >>>> It creates both anxiety and more expense. While we can't avoid > >>>> userspace's unmap to break up unified PTEs such as CONT-PTE for > >>>> a large folio, we can indeed keep away from kernel's breaking up > >>>> them due to its code design. > >>>> This patch is holding PTL from PTE0, thus, the folio will either > >>>> be entirely reclaimed or entirely kept. On the other hand, this > >>>> approach doesn't increase PTL contention. Even w/o the patch, > >>>> page_vma_mapped_walk() will always get PTL after it sometimes > >>>> skips one or two PTEs because intermediate break-before-makes > >>>> are short, according to test. Of course, even w/o this patch, > >>>> the vast majority of try_to_unmap_one still can get PTL from > >>>> PTE0. This patch makes the number 100%. > >>>> The other option is that we can give up in try_to_unmap_one > >>>> once we find PTE0 is not the first entry we get PTL, we call > >>>> page_vma_mapped_walk_done() to end the iteration at this case. > >>>> This will keep the unified PTEs while the folio isn't reclaimed. > >>>> The result is quite similar with small folios with one PTE - > >>>> either entirely reclaimed or entirely kept. > >>>> Reclaiming large folios by holding PTL from PTE0 seems a better > >>>> option comparing to giving up after detecting PTL begins from > >>>> non-PTE0. > >>>> > >> > >> I'm sure that wall of text can be formatted in a better way :) . Also,= I think > >> we can drop some of the details, > >> > >> If you need some inspiration, I can give it a shot. > >> > >>>> Cc: Hugh Dickins > >>>> Signed-off-by: Barry Song > >>> > >>> Do we need a Fixes tag? > >>> > >> > >> What would be the description of the problem we are fixing? > >> > >> 1) failing to unmap? > >> > >> That can happen with small folios as well IIUC. > >> > >> 2) Putting the large folio on the deferred split queue? > >> > >> That sounds more reasonable. > > > > Isn't the real problem today that we can end up writng a THP to the swa= p file > > (so 2M more IO and space used) but we can't remove it from memory, so n= o actual > > reclaim happens? Although I guess your (2) is really just another way o= f saying > > that. > > The same could happen with small folios I believe? We might end up > running into the > > folio_mapped() > > after the try_to_unmap(). > > Note that the actual I/O does not happen during add_to_swap(), but > during the pageout() call when we find the folio to be dirty. > > So there would not actually be more I/O. Only swap space would be > reserved, that would be used later when not running into the race. I am not worried about small folios at all as they have only one PTE. so the PTE is either completely unmapped or completely mapped. In terms of large folios, it is a different story. for example, a large folio with 16 PTEs with CONT-PTE, we will have 1. unfolded CONT-PTE, eg. PTE0 present, PTE1-PTE15 swap entries 2. page faults on PTE1-PTE15 after try_to_unmap if we access them. This is totally useless PF and can be avoided if we can try_to_unmap properly at the beginning. 3. potential need to split a large folio afterwards. for example, MADV_PAGE= OUT, MADV_FREE might split it after finding it is not completely mapped. For small folios, we don't have any concern on the above issues. > > -- > Cheers, > > David / dhildenb > Thanks Barry