From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 559EAC38142 for ; Wed, 1 Feb 2023 16:22:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CB0256B0072; Wed, 1 Feb 2023 11:22:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C60BE6B0073; Wed, 1 Feb 2023 11:22:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B010F6B0074; Wed, 1 Feb 2023 11:22:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id A06DC6B0072 for ; Wed, 1 Feb 2023 11:22:25 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 53577A08C9 for ; Wed, 1 Feb 2023 16:22:25 +0000 (UTC) X-FDA: 80419240650.05.7917ED7 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf22.hostedemail.com (Postfix) with ESMTP id 98CA0C0021 for ; Wed, 1 Feb 2023 16:22:20 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="BUs/Wbw8"; spf=pass (imf22.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675268540; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=swqZAd72U/oyplCJ36aVj+pQFkcXE/1iOSAniu5rpx8=; b=qDa7N358cWkiAkVJt+X0QZIpC1Q+n0ysMULrG0d50R2AMvPr61sCKfQZ0IFrexHUFaY5qd h+vvdSH4mcM8eu1y+vPGIR7VODsymYnFBvFp1sDoJs7Qjcuf7CBfIE8D0hTdhv2RNTCoET kHSw8Gv+kSxvG6yQIIyxOmaf6AjI6GI= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="BUs/Wbw8"; spf=pass (imf22.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675268540; a=rsa-sha256; cv=none; b=wHzuWXmSqjXeLv7N1GHHeL2y9wTJ2fzn6eg6YwJ188DvT8rgAl+gQaqT2fCXdSJoOjAjXm nLLGIY55gt6KPvD3Qb9yOf8kk6oH4CLB5sUKrNos92hPLcku0dt8wTgTYDBgFGlm6C+kZg KlFhQOM0IYQ4zLGmB/wmDFHsGpvnfIs= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1675268539; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=swqZAd72U/oyplCJ36aVj+pQFkcXE/1iOSAniu5rpx8=; b=BUs/Wbw8eWP0VtXkIk+KV9gZwzGAYwf27i6DkmR4EHlPh3aXnI7ZlZUP1HSOGWuwGkUu0C blE0JW3HkDMMJT0GVF9tJdY3pZSEoNaiwKnq1vqzDAqX5VkJJYnSzF5sQsf65scQQ9eHJa iSX0+IRZ+0LEfLmhcNkl2jcx5AxsOQ8= Received: from mail-qv1-f70.google.com (mail-qv1-f70.google.com [209.85.219.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-445-P64gV8hLNHaqPaJl-SPw3g-1; Wed, 01 Feb 2023 11:22:18 -0500 X-MC-Unique: P64gV8hLNHaqPaJl-SPw3g-1 Received: by mail-qv1-f70.google.com with SMTP id ng1-20020a0562143bc100b004bb706b3a27so10512761qvb.20 for ; Wed, 01 Feb 2023 08:22:18 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=swqZAd72U/oyplCJ36aVj+pQFkcXE/1iOSAniu5rpx8=; b=JKZXQUAyDvhpjGBubScolWKrEBwrxMmBxXGI8SHddO/9tzBld05HsYqC41UiuNjaQf FMrJhQJW2zOzDvXujhO2hjiZxn6K7OV2T7o9tWtPitCvX1NK7+QzypFBy1Ej8MobgR1N A3uwTMFKNYdbBnIDp+bfEieOt+NK7BwDtQw1U5l7oFcnrIzlnHR2SnWY4ojHQqcSiLsb nWVvvbc4W/2oSOU861QJMkeGqdiMOMQ1ATzJfd3Y/fcoRKHxVXuxX4gCY3pe6mIkP6dV RDA/4zscxsdmTU2Min0jqY/gmxH7XELrydAqVslq80CxK3/tpLp7MV3SWRYzLvCcWqRP yv6g== X-Gm-Message-State: AO0yUKWVmi4A383rBmqP0FAigjOVD9ZR03fc48OF2HKJl4fUoNTKCeiQ G25yM1TfJ8LyPFrh+zukIk4q+i47b0LewqZsob0vKdNojlZGXZ8liiKGvwQ3D9LWPTX9FsW8l8T u+M/gpkffuM8= X-Received: by 2002:ac8:7108:0:b0:3b8:6d5a:3457 with SMTP id z8-20020ac87108000000b003b86d5a3457mr4492506qto.6.1675268536179; Wed, 01 Feb 2023 08:22:16 -0800 (PST) X-Google-Smtp-Source: AK7set/1xXftjAmat/Molw8aH5XW/0TWq7ykpoC05ETkysGbpnD1Ogh5/9TPafcnJv6vCSKsDOkd0g== X-Received: by 2002:ac8:7108:0:b0:3b8:6d5a:3457 with SMTP id z8-20020ac87108000000b003b86d5a3457mr4492470qto.6.1675268535845; Wed, 01 Feb 2023 08:22:15 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id j20-20020ac85514000000b003b86a6449b8sm6158770qtq.85.2023.02.01.08.22.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 01 Feb 2023 08:22:15 -0800 (PST) Date: Wed, 1 Feb 2023 11:22:13 -0500 From: Peter Xu To: James Houghton Cc: Mike Kravetz , David Hildenbrand , Muchun Song , David Rientjes , Axel Rasmussen , Mina Almasry , Zach O'Keefe , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range Message-ID: References: MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Queue-Id: 98CA0C0021 X-Stat-Signature: fxo3g3xuwdyc9qse6pjnocxf883ftrih X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1675268540-184431 X-HE-Meta: U2FsdGVkX1/UNt4EUhFzQGSoVhtT4wGQ4lB0CKUAdjxOXYr8Y8ILzamYSrXAFva3h+LLJnh9YPHdH18vW1B9ZeCUYbBvL7/rnevBxA9QA5/1bjTV3Sg0JAsOB8Zq23xgu08vdA3Bkv0wrqiQeOWDpxbk6OKbFnjNr+im+3NmXTR6tOGHrv5QUakVsx4CHNq+JBhrUSoB1yh4P6WfO5g/wyTlRg3HTDnB7A5b+/Oq2wBYtNIQhZHNTGajKSbSRblzK787o66/vDBzvbPisIaMG4cAYFJUZfD6GBxLUTBnqh67exQIhE/dnFm7+fa45sGJ9vPIk39g2uUlaE+JReaKf21ArFkX4XLSmjpX/sMT6wcmsaSz4iNZ9D+trGeSzWYYdT1WPStay3UUiGkl0HOG/h6aWuqJ/2zv3AA1DKx2nPavl+kbjgohIlUZfDn1mFyRkPG+ZMmo0gYyLslj73gYE2vRT69TOz9aPoQQcSpRJHw9fF8m74FNFyCtLqAW1nHkO4vtRS3VSeoiLVDYecHQk9y91xiqqyCeNxQEwXDW0cQyFaS+tEDZUU5BcUmTZrjBm9F9FfoGmFwwEBP26dxQdJcSCnOT1PxnJzF7PQQK34HQAsDEsJWeVXT+TXTlToHwQcm6IWHPDsn1DafThOpXu+jccf6JNkZpDGQ2KvPXA7s/plQ6mr8l7WYB79PWSLPfBMMI51nBhh/FzO5l1JdDrh/akWptlhzhfG2K8hqUdumGGJ4lIZsskRi6c/Oi0AWIYST8sAS9RmNglYyEUCra8CrO9q3hRBw9kd5czZ+s3qduW5ANTN4rE0Bdn2wrv4HiySMbXYGuoFLxerWv5/uRXBxJLxpXLdUtvwB0fumcNo4o3VrQhDP+SNsgjUMvrSdgnXMArA418nIsga8wD6AZwZlhTKd95MR/oXerqU8xrJjBG7X7u1S75IGMdd3iZbiojxd6QXyd9Af2A8rCmgx pKjIju5E UaoglQ/kd7QI7DI9jTXCs3p5okHXpT4eLCnlrrsmFQDytZ6JSx8fOw6SL7ixPMsbG5GPycrCkD0TJuUx6K5P9v0O70xMw9m96+i2U0Q3UsM1Xwim6IdYCY0CC+oW/jdJOujRIZiE+s60/MWCa4CN/q/746PhlhAOLKtR9+Yxs4Iq5uoMBcIxa6OjQc1swwQVbjFXeIggUnepJOdnHyzhMrojosVX+HZ9AaUHFrquH++cIVgWIlrHy1//BB5dUGgoYOHsYt9tftgHL2OSEA5KC/Q/FNdjxkLimBuAsEY/wfYsnxuvK8JlFGtx6w8wxrwySXnqTk70K+B70/A2AyWPylyWUII+XD//nzldPDcmfEDB/cJIz7vSEDeiKzO/MjWr41JtoDzHR1621vhWLvJsPnKv4riIwATPjPLNdDvratPDZzBsArjuLac8s71x7uWucSPyyhRpXrP5Awwjq3c0VHZXYUxcOLmwY56zQ8Ychifr+ppPvHnz7q6doM+CD9f6AAvs6fktMKki6WnGUwpxXAZF3pF2tBG/8ysyEC5IpJWPnJZ7kjGTlledqtG4opMlv2Jyh X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Feb 01, 2023 at 07:45:17AM -0800, James Houghton wrote: > On Tue, Jan 31, 2023 at 5:24 PM Peter Xu wrote: > > > > On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote: > > > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu wrote: > > > > > > > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote: > > > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu wrote: > > > > > > > > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote: > [snip] > > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to > > > > > > page_mapcount() on small page, is to only increase the hpage mapcount only > > > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter > > > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd > > > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when > > > > > > freeing pgtable of the PUD entry). > > > > > > > > > > Right, and this is doable. Also it seems like this is pretty close to > > > > > the direction Matthew Wilcox wants to go with THPs. > > > > > > > > I may not be familiar with it, do you mean this one? > > > > > > > > https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/ > > > > > > Yep that's it. > > > > > > > > > > > For hugetlb I think it should be easier to maintain rather than any-sized > > > > folios, because there's the pgtable non-leaf entry to track rmap > > > > information and the folio size being static to hpage size. > > > > > > > > It'll be different to folios where it can be random sized pages chunk, so > > > > it needs to be managed by batching the ptes when install/zap. > > > > > > Agreed. It's probably easier for HugeTLB because they're always > > > "naturally aligned" and yeah they can't change sizes. > > > > > > > > > > > > > > > > > Something I noticed though, from the implementation of > > > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount() > > > > > ought to report the total number of PTEs that are pointing on the page > > > > > (or the number of times page_vma_mapped_walk returns true). FWIW, > > > > > folio_referenced() is never called for hugetlb folios. > > > > > > > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks - > > > > it'll walk every leaf page being mapped, big or small, so IIUC that number > > > > should match with what it expects to see later, more or less. > > > > > > I don't fully understand what you mean here. > > > > I meant the rmap_walk pairing with folio_referenced_one() will walk all the > > leaves for the folio, big or small. I think that will match the number > > with what got returned from folio_mapcount(). > > See below. > > > > > > > > > > > > > > But I agree the mapcount/referenced value itself is debatable to me, just > > > > like what you raised in the other thread on page migration. Meanwhile, I > > > > am not certain whether the mapcount is accurate either because AFAICT the > > > > mapcount can be modified if e.g. new page mapping established as long as > > > > before taking the page lock later in folio_referenced(). > > > > > > > > It's just that I don't see any severe issue either due to any of above, as > > > > long as that information is only used as a hint for next steps, e.g., to > > > > swap which page out. > > > > > > I also don't see a big problem with folio_referenced() (and you're > > > right that folio_mapcount() can be stale by the time it takes the > > > folio lock). It still seems like folio_mapcount() should return the > > > total number of PTEs that map the page though. Are you saying that > > > breaking this would be ok? > > > > I didn't quite follow - isn't that already doing so? > > > > folio_mapcount() is total_compound_mapcount() here, IIUC it is an > > accumulated value of all possible PTEs or PMDs being mapped as long as it's > > all or part of the folio being mapped. > > We've talked about 3 ways of handling mapcount: > > 1. The RFC v2 way, which is head-only, and we increment the compound > mapcount for each PT mapping we have. So a PTE-mapped 2M page, > compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias). > 2. The THP-like way. If we are fully mapping the hugetlb page with the > hstate-level PTE, we increment the compound mapcount, otherwise we > increment subpage->_mapcount. > 3. The RFC v1 way (the way you have suggested above), which is > head-only, and we increment the compound mapcount if the hstate-level > PTE is made present. Oh that's where it come from! It took quite some months going through all these, I can hardly remember the details. > > With #1 and #2, there is no concern with folio_mapcount(). But with > #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA > would yield 1 instead of 512 (right?). That's what I mean. > > #1 has problems wrt smaps and migration (though there were other > problems with those anyway that Mike has fixed), and #2 makes > MADV_COLLAPSE slow to the point of being unusable for some > applications. Ah so you're talking about after HGM being applied.. while I was only talking about THPs. If to apply the logic here with idea 3), the worst case is we'll need to have special care of HGM hugetlb in folio_referenced_one(), so the default page_vma_mapped_walk() may not apply anymore - the resource is always in hstate sized, so counting small ptes do not help too - we can just walk until the hstate entry and do referenced++ if it's not none, at the entrance of folio_referenced_one(). But I'm not sure whether that'll be necessary at all, as I'm not sure whether that path can be triggered at all in any form (where from the top it should always be shrink_page_list()). In that sense maybe we can also consider adding a WARN_ON_ONCE() in folio_referenced() where it is a hugetlb page that got passed in? Meanwhile, adding a TODO comment explaining that current walk won't work easily for HGM only, so when it will be applicable to hugetlb we need to rework? I confess that's not pretty, though. But that'll make 3) with no major defect from function-wise. Side note: did we finish folio conversion on hugetlb at all? I think at least we need some helper like folio_test_huge(). It seems still missing. Maybe it's another clue that hugetlb is not important to folio_referenced() because it's already fully converted? > > It seems like the least bad option is #1, but maybe we should have a > face-to-face discussion about it? I'm still collecting some more > performance numbers. Let's see how it goes.. Thanks, -- Peter Xu