From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f70.google.com (mail-pg0-f70.google.com [74.125.83.70]) by kanga.kvack.org (Postfix) with ESMTP id CED9A6B0572 for ; Tue, 1 Aug 2017 15:21:47 -0400 (EDT) Received: by mail-pg0-f70.google.com with SMTP id l2so15219153pgu.2 for ; Tue, 01 Aug 2017 12:21:47 -0700 (PDT) Received: from mail-pg0-x243.google.com (mail-pg0-x243.google.com. [2607:f8b0:400e:c05::243]) by mx.google.com with ESMTPS id d12si16045567pgn.909.2017.08.01.12.21.46 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 01 Aug 2017 12:21:46 -0700 (PDT) Received: by mail-pg0-x243.google.com with SMTP id 83so3876888pgb.4 for ; Tue, 01 Aug 2017 12:21:46 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: [PATCH v2 4/4] mm: fix KSM data corruption From: Nadav Amit In-Reply-To: <1501566977-20293-5-git-send-email-minchan@kernel.org> Date: Tue, 1 Aug 2017 12:21:41 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <1501566977-20293-1-git-send-email-minchan@kernel.org> <1501566977-20293-5-git-send-email-minchan@kernel.org> Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: Andrew Morton , LKML , "open list:MEMORY MANAGEMENT" , kernel-team , Mel Gorman , Hugh Dickins , Andrea Arcangeli Minchan Kim wrote: > Nadav reported KSM can corrupt the user data by the TLB batching = race[1]. > That means data user written can be lost. >=20 > Quote from Nadav Amit > " > For this race we need 4 CPUs: >=20 > CPU0: Caches a writable and dirty PTE entry, and uses the stale value = for > write later. >=20 > CPU1: Runs madvise_free on the range that includes the PTE. It would = clear > the dirty-bit. It batches TLB flushes. >=20 > CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. = We > care about the fact that it clears the PTE write-bit, and of course, = batches > TLB flushes. >=20 > CPU3: Runs KSM. Our purpose is to pass the following test in > write_protect_page(): >=20 > if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) || > (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) >=20 > Since it will avoid TLB flush. And we want to do it while the PTE is = stale. > Later, and before replacing the page, we would be able to change the = page. >=20 > Note that all the operations the CPU1-3 perform canhappen in parallel = since > they only acquire mmap_sem for read. >=20 > We start with two identical pages. Everything below regards the same > page/PTE. >=20 > CPU0 CPU1 CPU2 CPU3 > ---- ---- ---- ---- > Write the same > value on page >=20 > [cache PTE as > dirty in TLB] >=20 > MADV_FREE > pte_mkclean() >=20 > 4 > clear_refs > pte_wrprotect() >=20 > write_protect_page() > [ success, no flush ] >=20 > pages_indentical() > [ ok ] >=20 > Write to page > different value >=20 > [Ok, using stale > PTE] >=20 > replace_page() >=20 > Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. = CPU0 > already wrote on the page, but KSM ignored this write, and it got = lost. > " >=20 > In above scenario, MADV_FREE is fixed by changing TLB batching API > including [set|clear]_tlb_flush_pending. Remained thing is soft-dirty = part. >=20 > This patch changes soft-dirty uses TLB batching API instead of = flush_tlb_mm > and KSM checks pending TLB flush by using mm_tlb_flush_pending so that > it will flush TLB to avoid data lost if there are other parallel = threads > pending TLB flush. >=20 > [1] = http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com >=20 > Note: > I failed to reproduce this problem through Nadav's test program which > need to tune timing in my system speed so didn't confirm it work. > Nadav, Could you test this patch on your test machine? >=20 > Thanks! >=20 > Cc: Nadav Amit > Cc: Mel Gorman > Cc: Hugh Dickins > Cc: Andrea Arcangeli > Signed-off-by: Minchan Kim > --- > fs/proc/task_mmu.c | 4 +++- > mm/ksm.c | 3 ++- > 2 files changed, 5 insertions(+), 2 deletions(-) >=20 > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 9782dedeead7..58ef3a6abbc0 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -1018,6 +1018,7 @@ static ssize_t clear_refs_write(struct file = *file, const char __user *buf, > enum clear_refs_types type; > int itype; > int rv; > + struct mmu_gather tlb; >=20 > memset(buffer, 0, sizeof(buffer)); > if (count > sizeof(buffer) - 1) > @@ -1062,6 +1063,7 @@ static ssize_t clear_refs_write(struct file = *file, const char __user *buf, > } >=20 > down_read(&mm->mmap_sem); > + tlb_gather_mmu(&tlb, mm, 0, -1); > if (type =3D=3D CLEAR_REFS_SOFT_DIRTY) { > for (vma =3D mm->mmap; vma; vma =3D = vma->vm_next) { > if (!(vma->vm_flags & VM_SOFTDIRTY)) > @@ -1083,7 +1085,7 @@ static ssize_t clear_refs_write(struct file = *file, const char __user *buf, > walk_page_range(0, mm->highest_vm_end, = &clear_refs_walk); > if (type =3D=3D CLEAR_REFS_SOFT_DIRTY) > mmu_notifier_invalidate_range_end(mm, 0, -1); > - flush_tlb_mm(mm); > + tlb_finish_mmu(&tlb, 0, -1); > up_read(&mm->mmap_sem); > out_mm: > mmput(mm); > diff --git a/mm/ksm.c b/mm/ksm.c > index 0c927e36a639..15dd7415f7b3 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -1038,7 +1038,8 @@ static int write_protect_page(struct = vm_area_struct *vma, struct page *page, > goto out_unlock; >=20 > if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) || > - (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) { > + (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)) || > + = mm_tlb_flush_pending(mm)) { > pte_t entry; >=20 > swapped =3D PageSwapCache(page); > --=20 > 2.7.4 I tested the patch-set, and my PoC does not fail anymore. Thanks, Nadav= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org