From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id AAB746B02F4 for ; Thu, 31 Aug 2017 15:16:03 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id c15so3504818pfm.0 for ; Thu, 31 Aug 2017 12:16:03 -0700 (PDT) Received: from mail-pf0-x235.google.com (mail-pf0-x235.google.com. [2607:f8b0:400e:c00::235]) by mx.google.com with ESMTPS id a100si277261pli.523.2017.08.31.12.16.01 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 31 Aug 2017 12:16:02 -0700 (PDT) Received: by mail-pf0-x235.google.com with SMTP id g13so1566783pfm.2 for ; Thu, 31 Aug 2017 12:16:01 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic From: Nadav Amit In-Reply-To: <20170831171241.GE31400@redhat.com> Date: Thu, 31 Aug 2017 12:15:58 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <3796F736-0430-49BD-8B7A-3576BE902D51@gmail.com> References: <20170829235447.10050-1-jglisse@redhat.com> <20170829235447.10050-3-jglisse@redhat.com> <6D58FBE4-5D03-49CC-AAFF-3C1279A5A849@gmail.com> <20170830172747.GE13559@redhat.com> <003685D9-4DA9-42DC-AF46-7A9F8A43E61F@gmail.com> <20170830212514.GI13559@redhat.com> <75825BFF-8ACC-4CAB-93EB-AD9673747518@gmail.com> <20170831004719.GF9445@redhat.com> <20170831171241.GE31400@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Andrea Arcangeli Cc: Jerome Glisse , Linux Kernel Mailing List , "open list:MEMORY MANAGEMENT" , Dan Williams , Ross Zwisler , Linus Torvalds , Bernhard Held , Adam Borowski , =?utf-8?B?UmFkaW0gS3LEjW3DocWZ?= , Wanpeng Li , Paolo Bonzini , Takashi Iwai , Mike Galbraith , "Kirill A . Shutemov" , axie , Andrew Morton , Joerg Roedel , iommu Andrea Arcangeli wrote: > On Wed, Aug 30, 2017 at 08:47:19PM -0400, Jerome Glisse wrote: >> On Wed, Aug 30, 2017 at 04:25:54PM -0700, Nadav Amit wrote: >>> For both CoW and KSM, the correctness is maintained by calling >>> ptep_clear_flush_notify(). If you defer the secondary MMU = invalidation >>> (i.e., replacing ptep_clear_flush_notify() with ptep_clear_flush() = ), you >>> will cause memory corruption, and page-lock would not be enough. >>=20 >> Just to add up, the IOMMU have its own CPU page table walker and it = can >> walk the page table at any time (not the page table current to = current >> CPU, IOMMU have an array that match a PASID with a page table and = device >> request translation for a given virtual address against a PASID). >>=20 >> So this means the following can happen with ptep_clear_flush() only: >>=20 >> CPU | IOMMU >> | - walk page table populate tlb at = addr A >> - clear pte at addr A | >> - set new pte | > mmu_notifier_invalidate_range_end | -flush IOMMU/device tlb >=20 >> Device is using old page and CPU new page :( >=20 > That condition won't be persistent. >=20 >> But with ptep_clear_flush_notify() >>=20 >> CPU | IOMMU >> | - walk page table populate tlb at = addr A >> - clear pte at addr A | >> - notify -> invalidate_range | > flush IOMMU/device tlb >> - set new pte | >>=20 >> So now either the IOMMU see the empty pte and trigger a page fault = (this is >> if there is a racing IOMMU ATS right after the IOMMU/device tlb flush = but >> before setting the new pte) or it see the new pte. Either way both = IOMMU >> and CPU have a coherent view of what a virtual address points to. >=20 > Sure, the _notify version is obviously safe. >=20 >> Andrea explained to me the historical reasons set_pte_at_notify call = the >> change_pte callback and it was intended so that KVM could update the >> secondary page table directly without having to fault. It is now a = pointless >> optimization as the call to range_start() happening in all the places = before >> any set_pte_at_notify() invalidate the secondary page table and thus = will >> lead to page fault for the vm. I have talk with Andrea on way to = bring back >> this optimization. >=20 > Yes, we known for a long time, the optimization got basically disabled > when range_start/end expanded. It'd be nice to optimize change_pte > again but this is for later. >=20 >> Yes we need the following sequence for IOMMU: >> - clear pte >> - invalidate IOMMU/device TLB >> - set new pte >>=20 >> Otherwise the IOMMU page table walker can populate IOMMU/device tlb = with >> stall entry. >>=20 >> Note that this is not necessary for all the case. For try_to_unmap it >> is fine for instance to move the IOMMU tlb shoot down after changing = the >> CPU page table as we are not pointing the pte to a different page. = Either >> we clear the pte or we set a swap entry and as long as the page that = use >> to be pointed by the pte is not free before the IOMMU tlb flush then = we >> are fine. >>=20 >> In fact i think the only case where we need the above sequence = (clear, >> flush secondary tlb, set new pte) is for COW. I think all other cases >> we can get rid of invalidate_range() from inside the page table lock >> and rely on invalidate_range_end() to call unconditionaly. >=20 > Even with CoW, it's not big issue if the IOMMU keeps reading from the > old page for a little while longer (in between PT lock release and > mmu_notifier_invalidate_range_end). >=20 > How can you tell you read the old data a bit longer, because it > noticed the new page only when mmu_notifier_invalidate_range_end run, > and not because the CPU was faster at writing than the IOMMU was fast > at loading the new pagetable? >=20 > I figure it would be detectable only that the CPU could see pageA > changing before pageB. The iommu-v2 could see pageB changing before > pageA. If that's a concern that is the only good reason I can think of > right now, for requiring ->invalidate_page inside the CoW PT lock to > enforce the same ordering. However if the modifications happens > simultaneously and it's a correct runtime, the order must not matter, > but still it would be detectable which may not be desirable. I don=E2=80=99t know what is the memory model that SVM provides, but = what you describe here potentially violates it. I don=E2=80=99t think user = software should be expected to deal with it. >=20 > Currently the _notify is absolutely needed to run ->invalidate_range > before put_page is run in the CoW code below, because of the put_page > that is done in a not scalable place, it would be better moved down > regardless of ->invalidate_range to reduce the size of the PT lock > protected critical section. >=20 > - if (new_page) > - put_page(new_page); >=20 > pte_unmap_unlock(vmf->pte, vmf->ptl); > mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); > + if (new_page) > + put_page(new_page); >=20 > Of course the iommu will not immediately start reading from the new > page, but even if it triggers a write protect fault and calls > handle_mm_fault, it will find a writeable pte already there and it'll > get an ->invalidate_range as soon as mmu_notifier_invalidate_range_end > runs so it can sure notice the new page. >=20 > Now write_protect_page in KSM... >=20 > What bad can happen if iommu keeps writing to the write protected > page, for a little while longer? As long as nothing writes to the page > anymore by the time write_protect_page() returns, pages_identical will > work. How do you know the IOMMU was just a bit faster and wrote a few > more bytes and it wasn't mmu_notifier_invalidate_range_end that run a > bit later after dropping the PT lock? I guess it should be ok in this case. >=20 > Now replace_page in KSM... >=20 > ptep_clear_flush_notify(vma, addr, ptep); > set_pte_at_notify(mm, addr, ptep, newpte); >=20 > page_remove_rmap(page, false); > if (!page_mapped(page)) > try_to_free_swap(page); > - put_page(page); >=20 > pte_unmap_unlock(ptep, ptl); > err =3D 0; > out_mn: > mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); > out: > + put_page(page); /* TODO: broken of course, fix error cases */ > return err; > } >=20 > If we free the old page after mmu_notifier_invalidate_range_end > (fixing up the error case, the above change ignores the error paths), > the content of the old and new page are identical for replace_page. >=20 > Even if new page takes immediately a COW, how do you know the IOMMU > was just a bit slower and read the old page content pre-COW? They're > guaranteed identical and both readonly already at that point. A potential problem might be that it requires that there would be no promotion of PTE access-rights (e.g., RO->RW), since in these cases = there would be no invalidation. I did not find such a code-path, but I might = have missed one, or new one can be introduced in the future. For example, = someone may optimize KSM not to COW the last reference of a KSM=E2=80=99d upon = page-fault. Therefore, such optimizations may require some soft of a checker, or the very least clear documentation when an invalidation is required, = especially that it would require invalidation on access-rights promotion, which is currently not needed. > All the above considered, if this is getting way too complex, it may > be preferable to keep things obviously safe and always run > mmu_notifier_invalidate_range inside the PT lock and be done with it, > and remove the ->invalidate_range from > mmu_notifier_invalidate_range_end too to avoid the double invalidates > for the secondary MMUs with hardware pagetable walkers and shared > pagetables with the primary MMU. >=20 > In principle the primary reason for doing _notify or explicit > mmu_notifier_invalidate_range() is to keep things simpler and to avoid > having to care where pages exactly gets freed (i.e. before or after > mmu_notifier_invalidate_range_end). >=20 > For example zap_page_range tlb gather freeing strictly requires an > explicit mmu_notifier_invalidate_range before the page is actually > freed (because the tlb gather will free the pages well before > mmu_notifier_invalidate_range_end can run). >=20 > The concern that an ->invalidate_range is always needed before PT lock > is released if the primary TLB was flushed inside PT lock, is a more > recent concern and it looks like to me it's not always needed but > perhaps only in some case. >=20 > An example where the ->invalidate_range inside > mmu_notifier_invalidate_range_end pays off, is madvise_free_pte_range. > That doesn't flush the TLB before setting the pagetable clean. So the > primary MMU can still write through the dirty primary TLB without > setting the dirty/accessed bit after madvise_free_pte_range returns. >=20 > ptent =3D ptep_get_and_clear_full(mm, addr, pte, > tlb->fullmm); >=20 > ptent =3D pte_mkold(ptent); > ptent =3D pte_mkclean(ptent); > set_pte_at(mm, addr, pte, ptent); > tlb_remove_tlb_entry(tlb, pte, addr); >=20 > Not even the primary TLB is flushed here. All concurrent writes of the > primary MMU can still go lost while MADV_FREE runs. All that is > guaranteed is that after madvise MADV_FREE syscall returns to > userland, the new writes will stick, so then it's also enough to call > ->invalidate_range inside the single > mmu_notifier_invalidate_range_end: >=20 > madvise_free_page_range(&tlb, vma, start, end); > mmu_notifier_invalidate_range_end(mm, start, end); > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >=20 > This why we got both _notify and mmu_notifier_invalidate_range_end. If > we remove ->invalidate_range from mmu_notifier_invalidate_range_end, > we'll have to add a mmu_notifier_invalidate_range in places like above > (just before or just after mmu_notifier_invalidate_range_end above). >=20 > So with ptep_clear_flush_notify that avoids any issue with page > freeing in places like CoW, and the explicit > mmu_notifier_invalidate_range in the tlb gather, the rest got covered > automatically by mmu_notifier_invalidate_range_end. And again this is > only started to be needed when we added support for hardware pagetable > walkers that cannot stop the pagetable walking (unless they break the > sharing of the pagetable with the primary MMU which of course is not > desirable and it would cause unnecessary overhead). >=20 > The current ->invalidate_range handling however results in double > calls here and there when armed, but it reduces the number of explicit > hooks required in the common code and it keeps the mmu_notifier code > less intrusive and more optimal when disarmed (but less optimal when > armed). So the current state is a reasonable tradeoff, but there's > room for optimization. Everything you say makes sense, but at the end of the day you end up = with two different schemes for flushing the primary and secondary TLBs. For example, the primary TLB flush mechanism is much more selective in = flushing than the secondary one. The discrepancy between the two surely does not make things simpler. Indeed, there might be different costs and trade-offs in the primary and secondary TLBs (e.g., TLB flush/shootdown time, TLB miss latency, = whether an address-space is always loaded), but I have some doubts whether this decision was data driven. Regards, Nadav -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org