From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D5BCCC3271F for ; Thu, 4 Jul 2024 16:44:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5C9E96B007B; Thu, 4 Jul 2024 12:44:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 57A426B0096; Thu, 4 Jul 2024 12:44:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 419566B0098; Thu, 4 Jul 2024 12:44:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 240A56B007B for ; Thu, 4 Jul 2024 12:44:26 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id D48E2A1986 for ; Thu, 4 Jul 2024 16:44:25 +0000 (UTC) X-FDA: 82302643290.20.CCBA612 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf19.hostedemail.com (Postfix) with ESMTP id C90431A0012 for ; Thu, 4 Jul 2024 16:44:23 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=NmvPvk0e; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf19.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720111445; a=rsa-sha256; cv=none; b=6WptgIeOhkr7732jvNuR1dYAxUB8zc5IbIfxsAdDOCdtwrLkiPahTP2iwJzZ8f6DPYTSjX 8YL8u6W14AGyQd3Lcb1LMND7s0ORYAgrmL2phswTs4FpkN94otsIfEV2IJ8VSD4+qdDCq5 Rmg5OsqF7MYuWI/1xltI9MiT44IX9qg= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=NmvPvk0e; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf19.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720111445; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qBw/J0yJYmD9f3CV3xbbj8+BC2H9ZQJrZWYrHVafND0=; b=Ev+iiRz33TTp1fF56VAbodyjTMs6sxXRS+H9DLIYUT8fqBvocrss6H/tpaeUFaibeR9YB9 ExUj0cWQnzDVcYa2NgZW3I3GnfEzjrGStPiAvL7C+DlxjDkqlP8F4E1I28mWsPNekeTSAG DF7oQ0k6TFCqmLczKMMwKpCFLuC7wEM= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1720111463; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=qBw/J0yJYmD9f3CV3xbbj8+BC2H9ZQJrZWYrHVafND0=; b=NmvPvk0eFyxdEdwD46CysSnf45/9w3rG6CfyADLgsIpgInjoPSQh4u6jxBXl7LC4MWeA8z lCSOKHsy8yp+imEBvTwYeuxLMm4a97DEsGwswgrMd58cN/4O70Gk/t+URaMFPc7ZOqdPwB PfZqta/qFGtSYrGFqRAF7UHXdoAns/c= Received: from mail-io1-f72.google.com (mail-io1-f72.google.com [209.85.166.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-292-JvaIs2cNPwa7DviAaB3peQ-1; Thu, 04 Jul 2024 12:44:22 -0400 X-MC-Unique: JvaIs2cNPwa7DviAaB3peQ-1 Received: by mail-io1-f72.google.com with SMTP id ca18e2360f4ac-7f3903cbe37so18061039f.0 for ; Thu, 04 Jul 2024 09:44:21 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720111458; x=1720716258; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=qBw/J0yJYmD9f3CV3xbbj8+BC2H9ZQJrZWYrHVafND0=; b=hqKZd+UFjPbGmm4yk1R1eMk0BER/4uUqYeYSkE4Z5imQS/0nqaSPd/pV7kgVBLxAE9 /JiCixsdJXi7IyGkC9EwoVpOm3uOI2BCwT0luumcx6mU6zJGtMJUhpBzw/rkA8re7aR5 NHqXbmSguM/ho8rdjAIVlmxpZ46qyMtIlZQmt+IGoZ8lqLZ9z3CXdnCklW/PVWb3eNn+ qBWHjpOg1O89UERWLual+Yj+bQknWMt/V22bRJLsJTFxFvt8VrPUi9yPLEMPl5n30N/S J7yrNC+RVpUGED/GS30laA2MGbV8mpejXZH6Ny37N2KTGEiwXRWSq3iDTwFfE5SZllcw IlDw== X-Forwarded-Encrypted: i=1; AJvYcCXMRkKmA2hqKbRn+xPLpfrh2CUBzQISEbkZjPGMp87dEKtgv6ea5w/a/Gmbtzug4B2UMemdOryiJ/s5FnAwm4OBzz0= X-Gm-Message-State: AOJu0YwBhrqgOYhLSzQC1Yd63DanFaKdcAOSua2rxx3O7kYIk19byRkM 7zqhX8apCILFvQuNac+6mWUlDzLOHWM3QbFIyvNOJHwluVv+/7kv15ixvqeezBUjctVScmVR37Y 4dks02B2O40+EslUJ1RcW8sce32iHcYEseDeAXfLrFtUcmgZB6DWMFM6L X-Received: by 2002:a92:c047:0:b0:376:38fa:f074 with SMTP id e9e14a558f8ab-3839acbc29dmr22764225ab.2.1720111458245; Thu, 04 Jul 2024 09:44:18 -0700 (PDT) X-Google-Smtp-Source: AGHT+IF5ZRD9473lpYFoMcreih2hHxW/kHYYnWWxv5eMI9U+zI40J+3J6xNYuTJ4g1nRRxrgSbZtow== X-Received: by 2002:a05:6358:2489:b0:1a5:a04d:14fa with SMTP id e5c5f4694b2df-1aa98c761a5mr253873555d.2.1720111437196; Thu, 04 Jul 2024 09:43:57 -0700 (PDT) Received: from x1n (pool-99-254-121-117.cpe.net.cable.rogers.com. [99.254.121.117]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6b5d8fa573fsm26424426d6.37.2024.07.04.09.43.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Jul 2024 09:43:56 -0700 (PDT) Date: Thu, 4 Jul 2024 12:43:53 -0400 From: Peter Xu To: David Hildenbrand Cc: Oscar Salvador , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Muchun Song , SeongJae Park , Miaohe Lin , Michal Hocko , Matthew Wilcox , Christophe Leroy , Jason Gunthorpe Subject: Re: [PATCH 00/45] hugetlb pagewalk unification Message-ID: References: <20240704043132.28501-1-osalvador@suse.de> <617169bc-e18c-40fa-be3a-99c118a6d7fe@redhat.com> <84d4e799-90da-487e-adba-6174096283b5@redhat.com> MIME-Version: 1.0 In-Reply-To: <84d4e799-90da-487e-adba-6174096283b5@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Queue-Id: C90431A0012 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 6zthktikwwp9jywooxtrqmw16pnzsdf7 X-HE-Tag: 1720111463-149891 X-HE-Meta: U2FsdGVkX1+EpaS4lVFGca28HT8OHAzb8Le+dxYXBf7cTgItHST7PhkaPRMpQKlRUQB1PbhYzNhsAJv/Digk+WfVWjNs+UjDa8sIh9iN4sb6D9VU+f2P68JB6DNRlZRMYsotVLoKpJaEGRZ4fhyEGfkvWDlZSFBzWdPuY+9O40zKOjn6CW+2bStx6okTqxj0WmdXWKbuRtii612TfGafZYg5+ze1tm/xeS966TXQv5xSd4IgS2x2fgUM5exSZZ09ziJmF6BZR1sNPZwS7vAJYsMOsIY6QVdmTkSjIBFnqMucIdsqTJ2NNGLAFo/fAUtQ2BVi7qDbkzh8lYfnigUAsgY0MkvSXTgp86SEraPWpm90QtsuA7/1OcXq7yVhb7Ub7tbjTBTUw7yNQxymvd+3hcsl2LT3Eo0nlyO9fnRXfxKJ5Yn9ZY5KyI7V6/ZF3vGTNMyjki3r2MnEntrPu1VmSvSraq1P4AOgk1RSUfCSdzHnOfpkKcdxv7fLd2xyeiaieE05aQZR3hfOO49BI074Vwrb/ixza3t2hYpUkZln9N+B7K6mLP3bbE2jnZDvhLfkwZ4XETIBwUx1owkMGVB2cZYFBOu4pa/d+9mC6A6D+0fGzK411dqxpeS2Y7yXh2vp7cy1AlIbILMas9ByGjt/8RCWuZFYEs7vzPipE4GJFAU6Z7jcdZU+3VLGGhNvMGj3bkCOEN+POdznLv3fAhsZDVTm0GpnLx3qaamz7ZgiJ+nhDh06bhUDRs3qDiA1SlYzd8IjkuCbiKEUS5db8bb0xIDvE7uiCc8OXvj1mCL6Jw4CkOI3C1hVKN/qiRA/e4Ji8V8kG1hHhU51dxjeQOCBESfa75zugijhAoVB44WbZUS23U2KNpONkmJEbbsayi2ucLERhKQFHgFEjHDm2kJRz340lKBOxxZznDT1hiNZoml4DIdO6J4BaV/dyaCDLGID/hbMyRPbpMfxio3Wsk8 3kWevLIU XTSk5Ha7stiR0iqXHAy4qyKIx16PKbggZx8Dwf9mei9ciCzobjAuyPZwqRUsO6VWWiNoOejwZvyhREBbvq/TO8Ff71U2eRbiD/ki5Emt7ly86qxaPZz7CC2VYgbZdfhpaJMh9puhMsXWtA202jtfegTnNAz9FpXPzvbhuGthHG4LgDR69n2eHNKJ1gCvQD6PgfB9VgpNkcNOU5590WLvr0PBRvWtSbRz6G7OA3PHnMmLj7WV+95zHUCewuvTpbrKgyI6GYGjNN10StXJjiThvQcBZNVHHeJNDIOaPYBfV8J99YY6XRqE08EpMkA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000012, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jul 04, 2024 at 05:23:30PM +0200, David Hildenbrand wrote: > On 04.07.24 16:30, Peter Xu wrote: > > Hey, David, > > > > Hi! > > > On Thu, Jul 04, 2024 at 12:44:38PM +0200, David Hildenbrand wrote: > > > There are roughly two categories of page table walkers we have: > > > > > > 1) We actually only want to walk present folios (to be precise, page > > > ranges of folios). We should look into moving away from the walk the > > > page walker API where possible, and have something better that > > > directly gives us the folio (page ranges). Any PTE batching would be > > > done internally. > > > > > > 2) We want to deal with non-present folios as well (swp entries and all > > > kinds of other stuff). We should maybe implement our custom page > > > table walker and move away from walk_page_range(). We are not walking > > > "pages" after all but everything else included :) > > > > > > Then, there is a subset of 1) where we only want to walk to a single address > > > (a single folio). I'm working on that right now to get rid of follow_page() > > > and some (IIRC 3: KSM an daemon) walk_page_range() users. Hugetlb will still > > > remain a bit special, but I'm afraid we cannot hide that completely. > > > > Maybe you are talking about the generic concept of "page table walker", not > > walk_page_range() explicitly? > > > > I'd agree if it's about the generic concept. For example, follow_page() > > definitely is tailored for getting the page/folio. But just to mention > > Oscar's series is only working on the page_walk API itself. What I see so > > far is most of the walk_page API users aren't described above - most of > > them do not fall into category 1) at all, if any. And they either need to > > fetch something from the pgtable where having the folio isn't enough, or > > modify the pgtable for different reasons. > > Right, but having 1) does not imply that we won't be having access to the > page table entry in an abstracted form, the folio is simply the primary > source of information that these users care about. 2) is an extension of 1), > but walking+exposing all (or most) other page table entries as well in some > form, which is certainly harder to get right. > > Taking a look at some examples: > > * madvise_cold_or_pageout_pte_range() only cares about present folios. > * madvise_free_pte_range() only cares about present folios. > * break_ksm_ops() only cares about present folios. > * mlock_walk_ops() only cares about present folios. > * damon_mkold_ops() only cares about present folios. > * damon_young_ops() only cares about present folios. > > There are certainly other page_walk API users that are more involved and > need to do way more magic, which fall into category 2). In particular things > like swapin_walk_ops(), hmm_walk_ops() and most fs/proc/task_mmu.c. Likely > there are plenty of them. > > > Taking a look at vmscan.c/walk_mm(), I'm not sure how much benefit there > even is left in using walk_page_range() :) Hmm, I need to confess from a quick look I didn't yet see why the current page_walk API won't work under p4d there.. it could be that I missed some details. > > > > > A generic pgtable walker looks still wanted at some point, but it can be > > too involved to be introduced together with this "remove hugetlb_entry" > > effort. > > My thinking was if "remove hugetlb_entry" cannot wait for "remove > page_walk", because we found a reasonable way to do it better and convert > the individual users. Maybe it can't. > > I've not given up hope that we can end up with something better and clearer > than the current page_walk API :) Oh so you meant you have plan to rewrite some of the page_walk API users to use the new API you plan to propose? It looks fine by me. I assume anything new will already taking hugetlb folios into account, so it'll "just work" and actually reduce number of patches here, am I right? If it still needs time to land, I think it's also fine that it's done on top of Oscar's. So it may boil down to the schedule in that case, and we may also want to know how Oscar sees this. > > > > > To me, that future work is not yet about "get the folio, ignore the > > pgtable", but about how to abstract different layers of pgtables, so the > > caller may get a generic concept of "one pgtable entry" with the level/size > > information attached, and process it at a single place / hook, and perhaps > > hopefully even work with a device pgtable, as long as it's a radix tree. > > To me 2) is an extension of 1). My thinking is that we can start with 1) > without having to are about all details of 2). If we have to make it as > generic that we can walk any page table layout out there in this world, I'm > not so sure. I still see a hope there, after all the radix pgtable is indeed a common abstraction and it looks to me a lot of things share that structure. IIUC one challenge of it is being fast. So.. I don't know. But I'll be more than happy to see it come if someone can work it out, and it just sounds very nice too if some chunk of code can be run the same for mm/, kvm/ and iommu/. Thanks, -- Peter Xu