From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pb0-f48.google.com (mail-pb0-f48.google.com [209.85.160.48]) by kanga.kvack.org (Postfix) with ESMTP id 8F56F6B0031 for ; Sun, 30 Mar 2014 22:16:19 -0400 (EDT) Received: by mail-pb0-f48.google.com with SMTP id md12so7462611pbc.7 for ; Sun, 30 Mar 2014 19:16:19 -0700 (PDT) Received: from mail-pd0-x22e.google.com (mail-pd0-x22e.google.com [2607:f8b0:400e:c02::22e]) by mx.google.com with ESMTPS id e10si3974578paw.251.2014.03.30.19.16.18 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 30 Mar 2014 19:16:18 -0700 (PDT) Received: by mail-pd0-f174.google.com with SMTP id y13so7250305pdi.5 for ; Sun, 30 Mar 2014 19:16:18 -0700 (PDT) Date: Mon, 31 Mar 2014 10:16:03 +0800 From: Shaohua Li Subject: Re: [patch]x86: clearing access bit don't flush tlb Message-ID: <20140331021603.GA24892@kernel.org> References: <20140326223034.GA31713@kernel.org> <53336907.1050105@redhat.com> <20140327171237.GA9490@kernel.org> <533470F7.4000406@redhat.com> <20140328190233.GA14905@kernel.org> <53381506.1090104@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <53381506.1090104@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: linux-mm@kvack.org, akpm@linux-foundation.org, hughd@google.com, mel@csn.ul.ie On Sun, Mar 30, 2014 at 08:58:46AM -0400, Rik van Riel wrote: > On 03/28/2014 03:02 PM, Shaohua Li wrote: > > On Thu, Mar 27, 2014 at 02:41:59PM -0400, Rik van Riel wrote: > >> On 03/27/2014 01:12 PM, Shaohua Li wrote: > >>> On Wed, Mar 26, 2014 at 07:55:51PM -0400, Rik van Riel wrote: > >>>> On 03/26/2014 06:30 PM, Shaohua Li wrote: > >>>>> > >>>>> I posted this patch a year ago or so, but it gets lost. Repost it here to check > >>>>> if we can make progress this time. > >>>> > >>>> I believe we can make progress. However, I also > >>>> believe the code could be enhanced to address a > >>>> concern that Hugh raised last time this was > >>>> proposed... > >>>> > >>>>> And according to intel manual, tlb has less than 1k entries, which covers < 4M > >>>>> memory. In today's system, several giga byte memory is normal. After page > >>>>> reclaim clears pte access bit and before cpu access the page again, it's quite > >>>>> unlikely this page's pte is still in TLB. And context swich will flush tlb too. > >>>>> The chance skiping tlb flush to impact page reclaim should be very rare. > >>>> > >>>> Context switch to a kernel thread does not result in a > >>>> TLB flush, due to the lazy TLB code. > >>>> > >>>> While I agree with you that clearing the TLB right at > >>>> the moment the accessed bit is cleared in a PTE is > >>>> not necessary, I believe it would be good to clear > >>>> the TLB on affected CPUs relatively soon, maybe at the > >>>> next time schedule is called? > >>>> > >>>>> --- linux.orig/arch/x86/mm/pgtable.c 2014-03-27 05:22:08.572100549 +0800 > >>>>> +++ linux/arch/x86/mm/pgtable.c 2014-03-27 05:46:12.456131121 +0800 > >>>>> @@ -399,13 +399,12 @@ int pmdp_test_and_clear_young(struct vm_ > >>>>> int ptep_clear_flush_young(struct vm_area_struct *vma, > >>>>> unsigned long address, pte_t *ptep) > >>>>> { > >>>>> - int young; > >>>>> - > >>>>> - young = ptep_test_and_clear_young(vma, address, ptep); > >>>>> - if (young) > >>>>> - flush_tlb_page(vma, address); > >>>>> - > >>>>> - return young; > >>>>> + /* > >>>>> + * In X86, clearing access bit without TLB flush doesn't cause data > >>>>> + * corruption. Doing this could cause wrong page aging and so hot pages > >>>>> + * are reclaimed, but the chance should be very rare. > >>>>> + */ > >>>>> + return ptep_test_and_clear_young(vma, address, ptep); > >>>>> } > >>>> > >>>> > >>>> At this point, we could use vma->vm_mm->cpu_vm_mask_var to > >>>> set (or clear) some bit in the per-cpu data of each CPU that > >>>> has active/valid tlb state for the mm in question. > >>>> > >>>> I could see using cpu_tlbstate.state for this, or maybe > >>>> another variable in cpu_tlbstate, so switch_mm will load > >>>> both items with the same cache line. > >>>> > >>>> At schedule time, the function switch_mm() can examine that > >>>> variable (it already touches that data, anyway), and flush > >>>> the TLB even if prev==next. > >>>> > >>>> I suspect that would be both low overhead enough to get you > >>>> the performance gains you want, and address the concern that > >>>> we do want to flush the TLB at some point. > >>>> > >>>> Does that sound reasonable? > >>> > >>> So looks what you suggested is to force tlb flush for a mm with access bit > >>> cleared in two corner cases: > >>> 1. lazy tlb flush > >>> 2. context switch between threads from one process > >>> > >>> Am I missing anything? I'm wonering if we should care about these corner cases. > >> > >> I believe the corner case is relatively rare, but I also > >> suspect that your patch could fail pretty badly in some > >> of those cases, and the fix is easy... > >> > >>> On the other hand, a thread might run long time without schedule. If the corner > >>> cases are an issue, the long run thread is a severer issue. My point is context > >>> switch does provide a safeguard, but we don't depend on it. The whole theory at > >>> the back of this patch is page which has access bit cleared is unlikely > >>> accessed again when its pte entry is still in tlb cache. > >> > >> On the contrary, a TLB with a good cache policy should > >> retain the most actively used entries, in favor of > >> less actively used ones. > >> > >> That means the pages we care most about keeping, are > >> the ones also most at danger of not having the accessed > >> bit flushed to memory. > >> > >> Does the attached (untested) patch look reasonable? > > > > It works obviously. Test shows tehre is no extra tradeoff too compared to just > > skip tlb flush. So I have no objection to this if you insist a safeguard like > > this. Should we force no entering lazy tlb too (in context_switch) if > > force_flush is set, because you are talking about it but I didn't see it in the > > patch? Should I push this or will you do it? > > Thank you for testing the patch. > > I think we should be fine not adding any code to the lazy tlb > entering code, because a kernel thread in lazy tlb mode should > not be accessing user space memory, except for the vhost-net > thread, which will then wake up userspace, and cause the flush. > > Since performance numbers were identical, can I use your > performance numbers when submitting my patch? :) Sure, I tested 10 threads sequential access memory to 2 pcie ssd. without the patch, the swapout speed is around 1.5G/s; with the patch, the speed is around 1.85G/s. As I said before, the patch can provide about 20% ~ 30% swapout speedup for ssd swap. Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org