Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
To: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>,
	Andrew Morton <akpm@linux-foundation.org>,
	Vasily Gorbik <gor@linux.ibm.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@kernel.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	David Hildenbrand <david@redhat.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	Yang Shi <shy828301@gmail.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Peter Xu <peterx@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Will Deacon <will@kernel.org>, Yu Zhao <yuzhao@google.com>,
	Alistair Popple <apopple@nvidia.com>,
	Ralph Campbell <rcampbell@nvidia.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Steven Price <steven.price@arm.com>,
	SeongJae Park <sj@kernel.org>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Huang Ying <ying.huang@intel.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	Christophe Leroy <christophe.leroy@csgroup.eu>,
	Zack Rusin <zackr@vmware.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Minchan Kim <minchan@kernel.org>,
	Christoph Hellwig <hch@infradead.org>, Song Liu <song@kernel.org>,
	Thomas Hellstrom <thomas.hellstrom@linux.intel.com>,
	Russell King <linux@armlinux.org.uk>,
	"David S. Miller" <davem@davemloft.net>,
	Michael Ellerman <mpe@ellerman.id.au>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Heiko Carstens <hca@linux.ibm.com>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Claudio Imbrenda <imbrenda@linux.ibm.com>,
	Alexander Gordeev <agordeev@linux.ibm.com>,
	Jann Horn <jannh@google.com>,
	Vishal Moola <vishal.moola@gmail.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page
Date: Wed, 5 Jul 2023 14:55:16 +0200	[thread overview]
Message-ID: <20230705145516.7d9d554d@thinkpad-T15> (raw)
In-Reply-To: <e678affb-5eee-a055-7af1-1d29a965663b@google.com>

On Tue, 4 Jul 2023 10:03:57 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> On Tue, 4 Jul 2023, Gerald Schaefer wrote:
> > On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> > Hugh Dickins <hughd@google.com> wrote:  
> > > On Thu, 29 Jun 2023, Hugh Dickins wrote:  
> > > > 
> > > > I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
> > > > extent that I've not even tried to verify it; but I think I do get the
> > > > point now, that we need further info than just PPHHAA to know whether
> > > > the page is on the list or not.  But I think that if we move where the
> > > > call_rcu() is done, then the page can stay on or off the list by same
> > > > rules as before (but need to check HH bits along with PP when deciding
> > > > whether to allocate, and whether to list_add_tail() when freeing).    
> > > 
> > > No, not quite the same rules as before: I came to realize that using
> > > list_add_tail() for the HH pages would be liable to put a page on the
> > > list which forever blocked reuse of PP list_add_tail() pages after it
> > > (could be solved by a list_move() somewhere, but we have agreed to
> > > prefer simplicity).
> > > 
> > > I've dropped the HH bits, I'm using PageActive like we did on powerpc,
> > > I've dropped most of the pte_free_*() helpers, and list_del_init() is
> > > an easier way of dealing with those "is it on the list" questions.
> > > I expect that we shall be close to reaching agreement on...  
> > 
> > This looks really nice, almost too good and easy to be true. I did not
> > find any obvious flaw, just some comments below. It also survived LTP
> > without any visible havoc, so I guess this approach is the best so far.  
> 
> Phew! I'm of course glad to hear this: thanks for your efforts on it.
> 
> ...
> > > --- a/arch/s390/mm/pgalloc.c
> > > +++ b/arch/s390/mm/pgalloc.c
> > > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> > >   * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
> > >   * while the PP bits are never used, nor such a page is added to or removed
> > >   * from mm_context_t::pgtable_list.
> > > + *
> > > + * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
> > > + * and prevents both 2K fragments from being reused. pte_free_defer() has to
> > > + * guarantee that its pgtable cannot be reused before the RCU grace period
> > > + * has elapsed (which page_table_free_rcu() does not actually guarantee).  
> > 
> > Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> > allow reuse before grace period elapsed. And I hope that it does so, by
> > setting the PP bits, which would be noticed in page_table_alloc(), in
> > case the page would be seen there.
> > 
> > Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
> > end of the list, and so they could be seen in page_table_alloc(), but they
> > should not be reused before grace period elapsed and __tlb_remove_table()
> > cleared the PP bits, as far as I understand.
> > 
> > So what exactly do you mean with "which page_table_free_rcu() does not actually
> > guarantee"?  
> 
> I'll answer without locating and re-reading what Jason explained earlier,
> perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
> he may have explained it better.  And without working out again all the
> MMU_GATHER #defines, and which of them do and do not apply to s390 here.
> 
> The detail that sticks in my mind is the fallback in tlb_remove_table()

Ah ok, I was aware of that "semi-RCU" fallback logic in tlb_remove_table(),
but that is rather a generic issue, and not s390-specific. I thought you
meant some s390-oddity here, of which we have a lot, unfortunately...
Of course, we call tlb_remove_table() from our page_table_free_rcu(), so
I guess you could say that page_table_free_rcu() cannot guarantee what
tlb_remove_table() cannot guarantee.

Maybe change to "which page_table_free_rcu() does not actually guarantee,
by calling tlb_remove_table()", to make it clear that this is not a problem
of page_table_free_rcu() itself.

> in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
> batch the tables for freeing by RCU, and resorts instead to an immediate 
> TLB flush (I think: that again involves chasing definitions) followed by
> tlb_remove_table_sync_one() - which just delivers an interrupt to each CPU,
> and is commented: 
> /*
>  * This isn't an RCU grace period and hence the page-tables cannot be
>  * assumed to be actually RCU-freed.
>  *
>  * It is however sufficient for software page-table walkers that rely on
>  * IRQ disabling.
>  */
> 
> Whether that's good for your PP pages or not, I've given no thought:
> I've just taken it on trust that what s390 has working today is good.

Yes, we should be fine with that, current code can be trusted :-)

> 
> If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(),
> then I would not have written "(which page_table_free_rcu() does not
> actually guarantee)".  But it cannot use call_rcu() because it does
> not have an rcu_head to work with - it's in some generic code, and
> there is no MMU_GATHER_CAN_USE_PAGE_RCU_HEAD for architectures to set.
> 
> And Jason would have much preferred us to address the issue from that
> angle; but not only would doing so destroy my sanity, I'd also destroy
> 20 architectures TLB-flushing, unbuilt and untested, in the attempt.

Oh yes, if your changes would have allowed to get rid of that "semi RCU"
logic, that would really be a major boost in popularity, I guess. But
it probably is as it is, because it is not so easily fixed...

> 
> ...
> > > @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> > >  		 */
> > >  		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> > >  		mask >>= 24;
> > > -		if (mask & 0x03U)
> > > +		if ((mask & 0x03U) && !PageActive(page)) {
> > > +			/*
> > > +			 * Other half is allocated, and neither half has had
> > > +			 * its free deferred: add page to head of list, to make
> > > +			 * this freed half available for immediate reuse.
> > > +			 */
> > >  			list_add(&page->lru, &mm->context.pgtable_list);
> > > -		else
> > > -			list_del(&page->lru);
> > > +		} else {
> > > +			/* If page is on list, now remove it. */
> > > +			list_del_init(&page->lru);
> > > +		}  
> > 
> > Ok, we might end up with some unnecessary list_del_init() here, e.g. if
> > other half is still allocated, when called from pte_free_defer() on a
> > fully allocated page, which was not on the list (and with PageActive, and
> > (mask & 0x03U) true).
> > Not sure if adding an additional mask check to the else path would be
> > needed, but it seems that list_del_init() should also be able to handle
> > this.  
> 
> list_del_init() is very cheap in the unnecessary case: the cachelines
> required are already there.  You don't want a flag to say whether to
> call it or not, it is already the efficient approach.

Yes, I also see no functional issue here. Just thought that the extra
write could be avoided, e.g. by checking for list_empty() or mask first.
But I guess that is simply the benefit of list_del_init(), that you
don't have to check, at least if it is guaranteed that rcu_head is
never in use here.

Then maybe adjust the comment, because now it makes you wonder, when
you read (and understand) the code, you see that this list_del_init()
might also be called for pages not on the list.

> 
> (But you were right not to use it in your pt_frag_refcount version,
> because there we were still trying to do the call_rcu() per fragment
> rather than per page, so page->lru could have been on the RCU queue.)

That is actually the one thing I still try to figure out, by drawing
pictures, i.e. if we really really never end up here on list_del_init(),
while using rcu_head, e.g. by racing PageActive.

> 
> > 
> > Same thought applies to the similar logic in page_table_free_rcu()
> > below.
> >   
> > >  		spin_unlock_bh(&mm->context.lock);
> > >  		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
> > >  		mask >>= 24;
> > > @@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> > >  	}
> > >  
> > >  	page_table_release_check(page, table, half, mask);
> > > -	pgtable_pte_page_dtor(page);
> > > -	__free_page(page);
> > > +	if (TestClearPageActive(page))
> > > +		call_rcu(&page->rcu_head, pte_free_now);
> > > +	else
> > > +		pte_free_now(&page->rcu_head);  
> > 
> > This ClearPageActive, and the similar thing in __tlb_remove_table() below,
> > worries me a bit, because it is done outside the spin_lock. It "feels" like
> > there could be some race with the PageActive checks inside the spin_lock,
> > but when drawing some pictures, I could not find any such scenario yet.
> > Also, our existing spin_lock is probably not supposed to protect against
> > PageActive changes anyway, right?  
> 
> Here (and similarly in __tlb_remove_table()) is where we are about to free
> the page table page: both of the fragments have already been released,
> there is nobody left who could be racing against us to set PageActive.

Yes, that is what makes this approach so nice, i.e. no more checking
for HH bits or worry about double call_rcu(), simply do the the freeing
whenever the page is ready. At least in theory, still drawing pictures :-)

But this really looks very good to me, and also works with LTP not worse
than the other approaches.

next prev parent reply	other threads:[~2023-07-05 12:56 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-20  7:35 [PATCH v2 00/12] mm: free retracted page table by RCU Hugh Dickins
2023-06-20  7:40 ` [PATCH v2 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s Hugh Dickins
2023-06-20  7:42 ` [PATCH v2 02/12] mm/pgtable: add PAE safety to __pte_offset_map() Hugh Dickins
2023-06-20  7:43 ` [PATCH v2 03/12] arm: adjust_pte() use pte_offset_map_nolock() Hugh Dickins
2023-06-20  7:45 ` [PATCH v2 04/12] powerpc: assert_pte_locked() " Hugh Dickins
2023-06-20  7:47 ` [PATCH v2 05/12] powerpc: add pte_free_defer() for pgtables sharing page Hugh Dickins
2023-06-20 11:45   ` Jason Gunthorpe
2023-06-20 19:54     ` Hugh Dickins
2023-06-20 23:52       ` Jason Gunthorpe
2023-06-22  2:36         ` Hugh Dickins
2023-06-27 17:01           ` Jason Gunthorpe
2023-06-27 20:53             ` Hugh Dickins
2023-06-20  7:49 ` [PATCH v2 06/12] sparc: add pte_free_defer() for pte_t *pgtable_t Hugh Dickins
2023-06-20  7:51 ` [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page Hugh Dickins
2023-06-28 19:16   ` Gerald Schaefer
2023-06-29  5:08     ` Hugh Dickins
2023-06-29 15:22       ` Jason Gunthorpe
2023-06-29 15:56         ` Gerald Schaefer
2023-06-30  6:00           ` Hugh Dickins
2023-07-02  4:32             ` Hugh Dickins
2023-07-04 13:40               ` Alexander Gordeev
2023-07-04 16:03                 ` Hugh Dickins
2023-07-04 15:19               ` Gerald Schaefer
2023-07-04 17:03                 ` Hugh Dickins
2023-07-05 12:55                   ` Gerald Schaefer [this message]
2023-07-06  1:20                     ` Hugh Dickins
2023-07-06 15:02                       ` Gerald Schaefer
2023-07-06 19:45                         ` Hugh Dickins
2023-07-10 17:21                     ` Jason Gunthorpe
2023-07-05  6:46               ` Alexander Gordeev
2023-07-06  0:52                 ` Hugh Dickins
2023-07-07 14:37                   ` Gerald Schaefer
2023-07-03 16:10             ` Gerald Schaefer
2023-06-29 13:59     ` Alexander Gordeev
2023-06-29 15:43       ` Gerald Schaefer
2023-06-30 13:38   ` Claudio Imbrenda
2023-06-30 15:28     ` Hugh Dickins
2023-06-30 16:25       ` Claudio Imbrenda
2023-06-30 19:22         ` Hugh Dickins
2023-07-03 11:00           ` Claudio Imbrenda
2023-07-03 21:29             ` Jason Gunthorpe
2023-06-20  7:53 ` [PATCH v2 08/12] mm/pgtable: add pte_free_defer() for pgtable as page Hugh Dickins
2023-06-20  7:54 ` [PATCH v2 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock Hugh Dickins
2023-06-20  7:56 ` [PATCH v2 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() Hugh Dickins
2023-06-20  8:04   ` [PATCH mm " Hugh Dickins
2023-06-20  7:58 ` [PATCH v2 11/12] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps() Hugh Dickins
2023-06-20  7:59 ` [PATCH v2 12/12] mm: delete mmap_write_trylock() and vma_try_start_write() Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230705145516.7d9d554d@thinkpad-T15 \
    --to=gerald.schaefer@linux.ibm.com \
    --cc=agordeev@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=christophe.leroy@csgroup.eu \
    --cc=davem@davemloft.net \
    --cc=david@redhat.com \
    --cc=gor@linux.ibm.com \
    --cc=hca@linux.ibm.com \
    --cc=hch@infradead.org \
    --cc=hughd@google.com \
    --cc=imbrenda@linux.ibm.com \
    --cc=ira.weiny@intel.com \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=lstoakes@gmail.com \
    --cc=mgorman@techsingularity.net \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=naoya.horiguchi@nec.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rcampbell@nvidia.com \
    --cc=rppt@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=song@kernel.org \
    --cc=sparclinux@vger.kernel.org \
    --cc=steven.price@arm.com \
    --cc=surenb@google.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=vbabka@suse.cz \
    --cc=vishal.moola@gmail.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=yuzhao@google.com \
    --cc=zackr@vmware.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox