Re: [patch]x86: clearing access bit don't flush tlb

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Shaohua Li <shli@kernel.org>
To: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org, hughd@google.com,
	mel@csn.ul.ie
Subject: Re: [patch]x86: clearing access bit don't flush tlb
Date: Mon, 31 Mar 2014 10:16:03 +0800	[thread overview]
Message-ID: <20140331021603.GA24892@kernel.org> (raw)
In-Reply-To: <53381506.1090104@redhat.com>

On Sun, Mar 30, 2014 at 08:58:46AM -0400, Rik van Riel wrote:
> On 03/28/2014 03:02 PM, Shaohua Li wrote:
> > On Thu, Mar 27, 2014 at 02:41:59PM -0400, Rik van Riel wrote:
> >> On 03/27/2014 01:12 PM, Shaohua Li wrote:
> >>> On Wed, Mar 26, 2014 at 07:55:51PM -0400, Rik van Riel wrote:
> >>>> On 03/26/2014 06:30 PM, Shaohua Li wrote:
> >>>>>
> >>>>> I posted this patch a year ago or so, but it gets lost. Repost it here to check
> >>>>> if we can make progress this time.
> >>>>
> >>>> I believe we can make progress. However, I also
> >>>> believe the code could be enhanced to address a
> >>>> concern that Hugh raised last time this was
> >>>> proposed...
> >>>>
> >>>>> And according to intel manual, tlb has less than 1k entries, which covers < 4M
> >>>>> memory. In today's system, several giga byte memory is normal. After page
> >>>>> reclaim clears pte access bit and before cpu access the page again, it's quite
> >>>>> unlikely this page's pte is still in TLB. And context swich will flush tlb too.
> >>>>> The chance skiping tlb flush to impact page reclaim should be very rare.
> >>>>
> >>>> Context switch to a kernel thread does not result in a
> >>>> TLB flush, due to the lazy TLB code.
> >>>>
> >>>> While I agree with you that clearing the TLB right at
> >>>> the moment the accessed bit is cleared in a PTE is
> >>>> not necessary, I believe it would be good to clear
> >>>> the TLB on affected CPUs relatively soon, maybe at the
> >>>> next time schedule is called?
> >>>>
> >>>>> --- linux.orig/arch/x86/mm/pgtable.c	2014-03-27 05:22:08.572100549 +0800
> >>>>> +++ linux/arch/x86/mm/pgtable.c	2014-03-27 05:46:12.456131121 +0800
> >>>>> @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_
> >>>>>  int ptep_clear_flush_young(struct vm_area_struct *vma,
> >>>>>  			   unsigned long address, pte_t *ptep)
> >>>>>  {
> >>>>> -	int young;
> >>>>> -
> >>>>> -	young = ptep_test_and_clear_young(vma, address, ptep);
> >>>>> -	if (young)
> >>>>> -		flush_tlb_page(vma, address);
> >>>>> -
> >>>>> -	return young;
> >>>>> +	/*
> >>>>> +	 * In X86, clearing access bit without TLB flush doesn't cause data
> >>>>> +	 * corruption. Doing this could cause wrong page aging and so hot pages
> >>>>> +	 * are reclaimed, but the chance should be very rare.
> >>>>> +	 */
> >>>>> +	return ptep_test_and_clear_young(vma, address, ptep);
> >>>>>  }
> >>>>
> >>>>
> >>>> At this point, we could use vma->vm_mm->cpu_vm_mask_var to
> >>>> set (or clear) some bit in the per-cpu data of each CPU that
> >>>> has active/valid tlb state for the mm in question.
> >>>>
> >>>> I could see using cpu_tlbstate.state for this, or maybe
> >>>> another variable in cpu_tlbstate, so switch_mm will load
> >>>> both items with the same cache line.
> >>>>
> >>>> At schedule time, the function switch_mm() can examine that
> >>>> variable (it already touches that data, anyway), and flush
> >>>> the TLB even if prev==next.
> >>>>
> >>>> I suspect that would be both low overhead enough to get you
> >>>> the performance gains you want, and address the concern that
> >>>> we do want to flush the TLB at some point.
> >>>>
> >>>> Does that sound reasonable?
> >>>
> >>> So looks what you suggested is to force tlb flush for a mm with access bit
> >>> cleared in two corner cases:
> >>> 1. lazy tlb flush
> >>> 2. context switch between threads from one process
> >>>
> >>> Am I missing anything? I'm wonering if we should care about these corner cases.
> >>
> >> I believe the corner case is relatively rare, but I also
> >> suspect that your patch could fail pretty badly in some
> >> of those cases, and the fix is easy...
> >>
> >>> On the other hand, a thread might run long time without schedule. If the corner
> >>> cases are an issue, the long run thread is a severer issue. My point is context
> >>> switch does provide a safeguard, but we don't depend on it. The whole theory at
> >>> the back of this patch is page which has access bit cleared is unlikely
> >>> accessed again when its pte entry is still in tlb cache.
> >>
> >> On the contrary, a TLB with a good cache policy should
> >> retain the most actively used entries, in favor of
> >> less actively used ones.
> >>
> >> That means the pages we care most about keeping, are
> >> the ones also most at danger of not having the accessed
> >> bit flushed to memory.
> >>
> >> Does the attached (untested) patch look reasonable?
> > 
> > It works obviously. Test shows tehre is no extra tradeoff too compared to just
> > skip tlb flush. So I have no objection to this if you insist a safeguard like
> > this. Should we force no entering lazy tlb too (in context_switch) if
> > force_flush is set, because you are talking about it but I didn't see it in the
> > patch? Should I push this or will you do it?
> 
> Thank you for testing the patch.
> 
> I think we should be fine not adding any code to the lazy tlb
> entering code, because a kernel thread in lazy tlb mode should
> not be accessing user space memory, except for the vhost-net
> thread, which will then wake up userspace, and cause the flush.
> 
> Since performance numbers were identical, can I use your
> performance numbers when submitting my patch? :)

Sure, I tested 10 threads sequential access memory to 2 pcie ssd. without the
patch, the swapout speed is around 1.5G/s; with the patch, the speed is around
1.85G/s. As I said before, the patch can provide about 20% ~ 30% swapout
speedup for ssd swap.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2014-03-31  2:16 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-26 22:30 Shaohua Li
2014-03-26 23:55 ` Rik van Riel
2014-03-27 17:12   ` Shaohua Li
2014-03-27 18:41     ` Rik van Riel
2014-03-28 19:02       ` Shaohua Li
2014-03-30 12:58         ` Rik van Riel
2014-03-31  2:16           ` Shaohua Li [this message]
2014-04-02 13:01 ` Mel Gorman
2014-04-02 15:42   ` Hugh Dickins
2014-04-03  0:42 Shaohua Li
2014-04-03 11:35 ` [patch] x86: " Ingo Molnar
2014-04-03 13:45   ` Shaohua Li
2014-04-04 15:01     ` Johannes Weiner
2014-04-08  7:58   ` Shaohua Li
2014-04-14 11:36     ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140331021603.GA24892@kernel.org \
    --to=shli@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox