From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4E438C19F28 for ; Wed, 3 Aug 2022 16:45:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 43C836B0072; Wed, 3 Aug 2022 12:45:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3EBA96B0073; Wed, 3 Aug 2022 12:45:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 28C958E0001; Wed, 3 Aug 2022 12:45:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 183DA6B0072 for ; Wed, 3 Aug 2022 12:45:39 -0400 (EDT) Received: from smtpin31.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id C5A974014A for ; Wed, 3 Aug 2022 16:45:38 +0000 (UTC) X-FDA: 79758857556.31.72CFD3D Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf25.hostedemail.com (Postfix) with ESMTP id BC8F5A0131 for ; Wed, 3 Aug 2022 16:45:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1659545134; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ojfn7cuIMbHCcuvRihfgUsrNAOLm8Anqm64eT7HEXNM=; b=AGvozYQwLWhP9/r3F9VHY5IMTmJlSnern+IJECdo+OD/8A8dxNgG9ehu3O5UQR/hgjNaMr WPK8rSEpHvMdQtJIAZUxPOJwNzLukmHyC18y9hYGB8Nudm8sG2yvhNEmT8iwLQ2zxHiQKu r++CRllnC0h7bd6n3b4cltKYYpzE0Cw= Received: from mail-qv1-f72.google.com (mail-qv1-f72.google.com [209.85.219.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-656-kV8QteARMVyiaDB48Jpq4g-1; Wed, 03 Aug 2022 12:45:33 -0400 X-MC-Unique: kV8QteARMVyiaDB48Jpq4g-1 Received: by mail-qv1-f72.google.com with SMTP id p13-20020ad45f4d000000b0044399a9bb4cso10371922qvg.15 for ; Wed, 03 Aug 2022 09:45:32 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=ojfn7cuIMbHCcuvRihfgUsrNAOLm8Anqm64eT7HEXNM=; b=FdfqkdDyxMyjcEWdG1+F1YYq0Hu7fKNcgNFwGl4kVk3nuIvJObIzJAfX1tgWqPMY8Q qKrBAgwByw1uQyTF0UzEYI37LRXF9QU3pr7TNfx8RVXCRunI3UDsat/MPu+Qt2IB9e4j pC06Bf0cTmPiMIbNjITLX8OZZaeL0nPpBjEPdBIfHpaYBMBZ2sAY4N11D8+Yi9zvlNZS AjO3uM9dBw81cbGySzYSdPTeUglNa62MGzywFOHnK//KBkU2kc/ZhQSplld3LZ/Q0fji m1HzEza7sgx7TQNb6XWV23ldie5PiZsYuoG2iktbntmN5R2cooxGin1aRfrE1EiUtyMC nhWQ== X-Gm-Message-State: AJIora9rRFjr5wMQ4XcpLPAu0gQ/oMBsw7gAqVIheE0vFrkag+xkOpKv SWE8tdnl5T4ukmM8PB2a6gAmh/geTFYmCApeyO7YwXzQMRddALCRklXPydVwUFKnz9WyBRVXWrF kaDSVyqmQqhw= X-Received: by 2002:ac8:7c48:0:b0:31f:83:85d9 with SMTP id o8-20020ac87c48000000b0031f008385d9mr22976767qtv.105.1659545132272; Wed, 03 Aug 2022 09:45:32 -0700 (PDT) X-Google-Smtp-Source: AGRyM1t8eQT06Po6PUhbOVkVB5uPwil19CmoUApfs94oFCHAADSF7K2hjPl6HOraQaf1D9DfVt5AZA== X-Received: by 2002:ac8:7c48:0:b0:31f:83:85d9 with SMTP id o8-20020ac87c48000000b0031f008385d9mr22976749qtv.105.1659545131985; Wed, 03 Aug 2022 09:45:31 -0700 (PDT) Received: from xz-m1.local (bras-base-aurron9127w-grc-35-70-27-3-10.dsl.bell.ca. [70.27.3.10]) by smtp.gmail.com with ESMTPSA id i7-20020a05622a08c700b0031ef69c9024sm11080429qte.91.2022.08.03.09.45.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 03 Aug 2022 09:45:31 -0700 (PDT) Date: Wed, 3 Aug 2022 12:45:30 -0400 From: Peter Xu To: Nadav Amit Cc: LKML , linux-mm@kvack.org, Andrea Arcangeli , Andi Kleen , Andrew Morton , David Hildenbrand , Hugh Dickins , Huang Ying , "Kirill A . Shutemov" , Vlastimil Babka Subject: Re: [PATCH 2/2] mm: Remember young bit for page migrations Message-ID: References: <20220803012159.36551-1-peterx@redhat.com> <20220803012159.36551-3-peterx@redhat.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=AGvozYQw; spf=pass (imf25.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1659545136; a=rsa-sha256; cv=none; b=FLXYlSqY82fAMrz1+8m6Oc8BZEWO3IXCpdG685d8eFEfzmEE1ynceKi1E16SsfSn0k8CSS p34tIy/lKQs21pqwcKBKI9ljWWNCEYrixN6LO3cJC6hWWL+K8skx7fE8C6L7AYoVizlCGD hbrxh6ZAope08yYZCYMENKuh1cbAQAw= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1659545136; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ojfn7cuIMbHCcuvRihfgUsrNAOLm8Anqm64eT7HEXNM=; b=IBj6/Gb0q0QJ9tr6nymk9JBMUKNiyiIS9ivzWufmCOX88+POZ/grGQt4D+vSLbjLIbzqf1 n01VS82rLiGhpeMzNJhLob2J5EbMrobPOrJuGODbXZ2aCTTiXTCafIQR1fnx8SNzNDPG5H lsvOLDJ4GX3LlnSQ2ikWaMND++dy5Mo= X-Rspamd-Server: rspam09 X-Rspam-User: Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=AGvozYQw; spf=pass (imf25.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Stat-Signature: arsfrauqpqhx1mxxm7igyecj99ho5s8e X-Rspamd-Queue-Id: BC8F5A0131 X-HE-Tag: 1659545134-394530 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Aug 03, 2022 at 12:42:54AM -0700, Nadav Amit wrote: > On Aug 2, 2022, at 6:21 PM, Peter Xu wrote: > > > When page migration happens, we always ignore the young bit settings in the > > old pgtable, and marking the page as old in the new page table using either > > pte_mkold() or pmd_mkold(). > > > > That's fine from functional-wise, but that's not friendly to page reclaim > > because the moving page can be actively accessed within the procedure. Not > > to mention hardware setting the young bit can bring quite some overhead on > > some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit. > > > > Actually we can easily remember the young bit configuration and recover the > > information after the page is migrated. To achieve it, define a new bit in > > the migration swap offset field to show whether the old pte has young bit > > set or not. Then when removing/recovering the migration entry, we can > > recover the young bit even if the page changed. > > > > One thing to mention is that here we used max_swapfile_size() to detect how > > many swp offset bits we have, and we'll only enable this feature if we know > > the swp offset can be big enough to store both the PFN value and the young > > bit. Otherwise the young bit is dropped like before. > > I gave it some more thought and I am less confident whether this is the best > solution. Not sure it is not either, so I am raising an alternative with > pros and cons. > > An alternative would be to propagate the access bit into the page (i.e., > using folio_set_young()) and then set it back into the PTE later (i.e., > based on folio_test_young()). It might even seem that in general it is > better to always set the page access bit if folio_test_young(). That's indeed an option. It's just that the Young bit (along with Idle bit) is only defined with PAGE_IDLE feature enabled, or they're all no-op. Another thing is even though using page flags looks clean, it'll lose the granule of virtual address spaces when the page can be mapped in multiple mm/vmas. I don't think there's a major difference here since page reclaim will collect either pte young or page young (as long as page idle defined) so it'll be the same. But there'll be other side effects as long as related to the virtual address space. E.g. extra TLB flush needed as you said even if the pages were not accessed by some mapping at all, so they become false positive "young" pages after migration entries recovered. In short, it'll be slightly less accurate than storing it in pgtables. > > This can be simpler and more performant. Setting the access-bit would not > impact reclaim decisions (as the page is already considered young), would > not induce overheads on clearing the access-bit (no TLB flush is needed at > least on x86), and would save the time the CPU takes to set the access bit > if the page is ever accessed (on x86). Agreed. These benefits should be shared between both the pgtable approach or the PageYoung approach you mentioned. > > It may also improve the preciseness of page-idle mechanisms and the > interaction with it. IIUC this may need extra work for page idle. Currently when doing rmap walks we either only look at migration entries or never look at them, according to the PVMW_MIGRATION flag. We'll need to teach the page idle code and rmap walker to be able to walk with both present ptes and migration entries at the same time to achieve that preciseness. > IIUC, page-idle does not consider migration entries, so > the user would not get indication that pages under migration are not idle. > When page-idle is reset, migrated pages might be later reinstated as > “accessed”, giving wrong indication that the pages are not-idle, when in > fact they are. Before this patchset, we'll constantly loosing that young bit, hence it'll be a false negative after migration entries recovered. After this patchset, we'll have a possible race iff the page idle was triggered during migrating some pages, but it's a race condition only and the young bit will be correctly collected after migration completed. So I agree it's a problem as you mentioned, but probably it's still better with current patch than without it. If ultimately we decided to go with page flags approach as you proposed, just to mention that setting PageYoung is not enough - that bit is only used to reserve page reclaim logic and not being interrupted by page idle. IMO what we really will need is clearing PageIdle instead. > > On the negative side, I am not sure whether other archs, that might require > a TLB flush for resetting the access-bit, and the overhead of doing atomic > operation to clear the access-bit, would not induce more overhead than they > would save. I think your proposal makes sense and looks clean, maybe even cleaner than the new max_swapfile_size() approach (and definitely nicer than the old one of mine). It's just that I still want this to happen even without page idle enabled - at least Fedora doesn't have page idle enabled by default. I'm not sure whether it'll be worth it to define Young bit just for this (note: iiuc we don't need Idle bit in this case, but only the Young bit). The other thing is whether there's other side effect of losing pte level granularity of young bit, since right after we merge them into the page flags, then that granule is lost. So far I don't worry a lot on the tlb flush overhead, but hopefully nothing else we missed. > > One more unrelated point - note that remove_migration_pte() would always set > a clean PTE even when the old one was dirty… Correct. Say it in another way, at least initial writes perf will still suffer after migration on x86. Dirty bit is kind of different in this case so I didn't yet try to cover it. E.g., we won't lose it even without this patchset but consolidates it into PageDirty already or it'll be a bug. I think PageDirty could be cleared during migration procedure, if so we could be wrongly applying the dirty bit to the recovered pte. I thought about this before posting this series, but I hesitated on adding dirty bit altogether with it at least in these initial versions since dirty bit may need some more justifications. Please feel free to share any further thoughts on the dirty bit. Thanks, -- Peter Xu