linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Claudio Imbrenda <imbrenda@linux.ibm.com>
To: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Gerald Schaefer <gerald.schaefer@linux.ibm.com>,
	Vasily Gorbik <gor@linux.ibm.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@kernel.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	David Hildenbrand <david@redhat.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	Yang Shi <shy828301@gmail.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Peter Xu <peterx@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Will Deacon <will@kernel.org>, Yu Zhao <yuzhao@google.com>,
	Alistair Popple <apopple@nvidia.com>,
	Ralph Campbell <rcampbell@nvidia.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Steven Price <steven.price@arm.com>,
	SeongJae Park <sj@kernel.org>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Huang Ying <ying.huang@intel.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	Christophe Leroy <christophe.leroy@csgroup.eu>,
	Zack Rusin <zackr@vmware.com>, Jason Gunthorpe <jgg@ziepe.ca>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Minchan Kim <minchan@kernel.org>,
	Christoph Hellwig <hch@infradead.org>, Song Liu <song@kernel.org>,
	Thomas Hellstrom <thomas.hellstrom@linux.intel.com>,
	Russell King <linux@armlinux.org.uk>,
	"David S. Miller" <davem@davemloft.net>,
	Michael Ellerman <mpe@ellerman.id.au>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Heiko Carstens <hca@linux.ibm.com>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Alexander Gordeev <agordeev@linux.ibm.com>,
	Jann Horn <jannh@google.com>,
	Vishal Moola <vishal.moola@gmail.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page
Date: Mon, 3 Jul 2023 13:00:13 +0200	[thread overview]
Message-ID: <20230703130013.559217c9@p-imbrenda> (raw)
In-Reply-To: <7f6d399b-c47-1faa-f7f6-9932b9811f8c@google.com>

On Fri, 30 Jun 2023 12:22:43 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

[...]

> That's something I would have expected to be handled already via
> mmu_notifiers, rather than buried inside the page table freeing.
> 
> If s390 is the only architecture to go that way, and could instead do
> it via mmu_notifiers, then I think that will be more easily supported
> in the long term.

I am very well aware of that, and in fact I've been working on
exactly that for some time already. But it's a very complex minefield
and therefore I'm proceeding *very* carefully. It will take quite some
time before anything comes out.

> 
> But I'm writing from a position of very great ignorance: advising
> KVM on s390 is many dimensions away from what I'm capable of.

fair enough, but in this case it doesn't mean that you are not right :)

> 
> > 
> > the point here is: we need that only for page_table_free_rcu(); all
> > other users of page_table_free() cannot act on guest page tables  
> 
> I might be wrong, but I think that most users of page_table_free()
> are merely freeing a page table which had to be allocated up front,
> but was then found unnecessary (maybe a racing task already inserted
> one): page tables which were never exposed to actual use.

that was my impression as well

> > (because we don't allow THP for KVM guests). and that is why
> > page_table_free() does not do gmap_unlink() currently.  
> 
> But THP collapse does (or did before this series) use it to free a
> page table which had been exposed to use.  The fact that s390 does

that is also my understanding

> not allow THP for KVM guests makes page_table_free(), and this new
> pte_free_defer(), safe for that; but it feels dangerously coincidental.

not really; my guess is that we _intentionally_ did not do anything
there, because we knew we did not need it, knowing well that
we would need it once we would want to support THP for guests.
so not a coincidence, but a conscious decision based, I guess, on
touching as little code as needed.

> 
> It's easy to imagine a future change being made, which would stumble
> over this issue.  I have imagined that pte_free_defer() will be useful
> in future, in the freeing of empty page tables: but s390 may pose a
> problem there - though perhaps no more of a problem than additionally
> needing to pass a virtual address down the stack.

yeah it can always be fixed later if we need to

> 
> >   
> > >   
> > > > 
> > > > or will it be used instead of page_table_free?    
> > > 
> > > Not always; but yes, this case of removing a page table used
> > > page_table_free() before; but now, with the lighter locking, needs
> > > to keep the page table valid until the RCU grace period expires.  
> > 
> > so if I understand correctly your code will, sometimes, under some
> > circumstances, replace what page_table_free() does, but it will never
> > replace page_table_free_rcu()?
> > 
> > because in that case there would be no issues   
> 
> Yes, thanks for confirming: we have no issue here at present, but may
> do if use of pte_free_defer() is extended to other contexts in future.
> 
> Would it be appropriate to add a WARN_ON_ONCE around that
> > > > > +	if (mm_alloc_pgste(mm)) {  
> in pte_free_defer()?

that's actually not a bad idea. should never happen, but... that's the
whole point of a WARN_ON after all

> 
> I ask that somewhat rhetorically: that block disappears in the later
> version I was working on last night (and will return to shortly), in
> which pte_free_defer() just sets a bit and calls page_table_free().
> 
> But I'd like to understand the possibilities better: does mm_alloc_pgste()
> correspond 1:1 to KVM guest on s390, or does it cover several different
> possibilities of which KVM guest is one, or am I just confused to be
> thinking there's any relationship?

this is... historically complicated (because of course it is)

in theory any process can allocate PGSTEs by having the right bits in
the ELF header (that's how QEMU does it currently). And QEMU will have
PGSTEs allocated even if it does not actually start any guests.

Then we have the vm.allocate_pgste sysctl knob; once enabled, it will
cause all processes to have PGSTEs allocated. This is how we handled
PGSTEs before we switched to ELF header bits.

So in summary: in __practice__ yes, only QEMU will have PGSTEs. But in
theory anything is possible and allowed.

> 
> Thanks,
> Hugh
> 
> >   
> > >   
> > > > 
> > > > this is actually quite important for KVM on s390    
> > > 
> > > None of us are wanting to break KVM on s390: your guidance appreciated!
> > > 
> > > Thanks,
> > > Hugh  



  reply	other threads:[~2023-07-03 11:01 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-20  7:35 [PATCH v2 00/12] mm: free retracted page table by RCU Hugh Dickins
2023-06-20  7:40 ` [PATCH v2 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s Hugh Dickins
2023-06-20  7:42 ` [PATCH v2 02/12] mm/pgtable: add PAE safety to __pte_offset_map() Hugh Dickins
2023-06-20  7:43 ` [PATCH v2 03/12] arm: adjust_pte() use pte_offset_map_nolock() Hugh Dickins
2023-06-20  7:45 ` [PATCH v2 04/12] powerpc: assert_pte_locked() " Hugh Dickins
2023-06-20  7:47 ` [PATCH v2 05/12] powerpc: add pte_free_defer() for pgtables sharing page Hugh Dickins
2023-06-20 11:45   ` Jason Gunthorpe
2023-06-20 19:54     ` Hugh Dickins
2023-06-20 23:52       ` Jason Gunthorpe
2023-06-22  2:36         ` Hugh Dickins
2023-06-27 17:01           ` Jason Gunthorpe
2023-06-27 20:53             ` Hugh Dickins
2023-06-20  7:49 ` [PATCH v2 06/12] sparc: add pte_free_defer() for pte_t *pgtable_t Hugh Dickins
2023-06-20  7:51 ` [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page Hugh Dickins
2023-06-28 19:16   ` Gerald Schaefer
2023-06-29  5:08     ` Hugh Dickins
2023-06-29 15:22       ` Jason Gunthorpe
2023-06-29 15:56         ` Gerald Schaefer
2023-06-30  6:00           ` Hugh Dickins
2023-07-02  4:32             ` Hugh Dickins
2023-07-04 13:40               ` Alexander Gordeev
2023-07-04 16:03                 ` Hugh Dickins
2023-07-04 15:19               ` Gerald Schaefer
2023-07-04 17:03                 ` Hugh Dickins
2023-07-05 12:55                   ` Gerald Schaefer
2023-07-06  1:20                     ` Hugh Dickins
2023-07-06 15:02                       ` Gerald Schaefer
2023-07-06 19:45                         ` Hugh Dickins
2023-07-10 17:21                     ` Jason Gunthorpe
2023-07-05  6:46               ` Alexander Gordeev
2023-07-06  0:52                 ` Hugh Dickins
2023-07-07 14:37                   ` Gerald Schaefer
2023-07-03 16:10             ` Gerald Schaefer
2023-06-29 13:59     ` Alexander Gordeev
2023-06-29 15:43       ` Gerald Schaefer
2023-06-30 13:38   ` Claudio Imbrenda
2023-06-30 15:28     ` Hugh Dickins
2023-06-30 16:25       ` Claudio Imbrenda
2023-06-30 19:22         ` Hugh Dickins
2023-07-03 11:00           ` Claudio Imbrenda [this message]
2023-07-03 21:29             ` Jason Gunthorpe
2023-06-20  7:53 ` [PATCH v2 08/12] mm/pgtable: add pte_free_defer() for pgtable as page Hugh Dickins
2023-06-20  7:54 ` [PATCH v2 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock Hugh Dickins
2023-06-20  7:56 ` [PATCH v2 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Hugh Dickins
2023-06-20  8:04   ` [PATCH mm " Hugh Dickins
2023-06-20  7:58 ` [PATCH v2 11/12] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps() Hugh Dickins
2023-06-20  7:59 ` [PATCH v2 12/12] mm: delete mmap_write_trylock() and vma_try_start_write() Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230703130013.559217c9@p-imbrenda \
    --to=imbrenda@linux.ibm.com \
    --cc=agordeev@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=christophe.leroy@csgroup.eu \
    --cc=davem@davemloft.net \
    --cc=david@redhat.com \
    --cc=gerald.schaefer@linux.ibm.com \
    --cc=gor@linux.ibm.com \
    --cc=hca@linux.ibm.com \
    --cc=hch@infradead.org \
    --cc=hughd@google.com \
    --cc=ira.weiny@intel.com \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=lstoakes@gmail.com \
    --cc=mgorman@techsingularity.net \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=naoya.horiguchi@nec.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rcampbell@nvidia.com \
    --cc=rppt@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=song@kernel.org \
    --cc=sparclinux@vger.kernel.org \
    --cc=steven.price@arm.com \
    --cc=surenb@google.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=vbabka@suse.cz \
    --cc=vishal.moola@gmail.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=yuzhao@google.com \
    --cc=zackr@vmware.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox