From: Laurent Dufour <ldufour@linux.vnet.ibm.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: paulmck@linux.vnet.ibm.com, peterz@infradead.org,
akpm@linux-foundation.org, ak@linux.intel.com, mhocko@kernel.org,
dave@stgolabs.net, jack@suse.cz,
Matthew Wilcox <willy@infradead.org>,
benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>,
hpa@zytor.com, Will Deacon <will.deacon@arm.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
haren@linux.vnet.ibm.com, khandual@linux.vnet.ibm.com,
npiggin@gmail.com, bsingharora@gmail.com,
Tim Chen <tim.c.chen@linux.intel.com>,
linuxppc-dev@lists.ozlabs.org, x86@kernel.org
Subject: Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure
Date: Wed, 30 Aug 2017 10:56:45 +0200 [thread overview]
Message-ID: <8bd2865a-f390-9114-9852-7e32d4113016@linux.vnet.ibm.com> (raw)
In-Reply-To: <20170827001823.n5wgkfq36z6snvf2@node.shutemov.name>
On 27/08/2017 02:18, Kirill A. Shutemov wrote:
> On Fri, Aug 18, 2017 at 12:05:13AM +0200, Laurent Dufour wrote:
>> +/*
>> + * vm_normal_page() adds some processing which should be done while
>> + * hodling the mmap_sem.
>> + */
>> +int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
>> + unsigned int flags)
>> +{
>> + struct vm_fault vmf = {
>> + .address = address,
>> + };
>> + pgd_t *pgd;
>> + p4d_t *p4d;
>> + pud_t *pud;
>> + pmd_t *pmd;
>> + int dead, seq, idx, ret = VM_FAULT_RETRY;
>> + struct vm_area_struct *vma;
>> + struct mempolicy *pol;
>> +
>> + /* Clear flags that may lead to release the mmap_sem to retry */
>> + flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
>> + flags |= FAULT_FLAG_SPECULATIVE;
>> +
>> + idx = srcu_read_lock(&vma_srcu);
>> + vma = find_vma_srcu(mm, address);
>> + if (!vma)
>> + goto unlock;
>> +
>> + /*
>> + * Validate the VMA found by the lockless lookup.
>> + */
>> + dead = RB_EMPTY_NODE(&vma->vm_rb);
>> + seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */
>> + if ((seq & 1) || dead)
>> + goto unlock;
>> +
>> + /*
>> + * Can't call vm_ops service has we don't know what they would do
>> + * with the VMA.
>> + * This include huge page from hugetlbfs.
>> + */
>> + if (vma->vm_ops)
>> + goto unlock;
>
> I think we need to have a way to white-list safe ->vm_ops.
>
>> +
>> + if (unlikely(!vma->anon_vma))
>> + goto unlock;
>
> It deserves a comment.
>
>> +
>> + vmf.vma_flags = READ_ONCE(vma->vm_flags);
>> + vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);
>> +
>> + /* Can't call userland page fault handler in the speculative path */
>> + if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
>> + goto unlock;
>> +
>> + /*
>> + * MPOL_INTERLEAVE implies additional check in mpol_misplaced() which
>> + * are not compatible with the speculative page fault processing.
>> + */
>> + pol = __get_vma_policy(vma, address);
>> + if (!pol)
>> + pol = get_task_policy(current);
>> + if (pol && pol->mode == MPOL_INTERLEAVE)
>> + goto unlock;
>> +
>> + if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
>> + /*
>> + * This could be detected by the check address against VMA's
>> + * boundaries but we want to trace it as not supported instead
>> + * of changed.
>> + */
>> + goto unlock;
>> +
>> + if (address < READ_ONCE(vma->vm_start)
>> + || READ_ONCE(vma->vm_end) <= address)
>> + goto unlock;
>> +
>> + /*
>> + * The three following checks are copied from access_error from
>> + * arch/x86/mm/fault.c
>> + */
>> + if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>> + flags & FAULT_FLAG_INSTRUCTION,
>> + flags & FAULT_FLAG_REMOTE))
>> + goto unlock;
>> +
>> + /* This is one is required to check that the VMA has write access set */
>> + if (flags & FAULT_FLAG_WRITE) {
>> + if (unlikely(!(vmf.vma_flags & VM_WRITE)))
>> + goto unlock;
>> + } else {
>> + if (unlikely(!(vmf.vma_flags & (VM_READ | VM_EXEC | VM_WRITE))))
>> + goto unlock;
>> + }
>> +
>> + /*
>> + * Do a speculative lookup of the PTE entry.
>> + */
>> + local_irq_disable();
>> + pgd = pgd_offset(mm, address);
>> + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
>> + goto out_walk;
>> +
>> + p4d = p4d_alloc(mm, pgd, address);
>> + if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
>> + goto out_walk;
>> +
>> + pud = pud_alloc(mm, p4d, address);
>> + if (pud_none(*pud) || unlikely(pud_bad(*pud)))
>> + goto out_walk;
>> +
>> + pmd = pmd_offset(pud, address);
>> + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
>> + goto out_walk;
>> +
>> + /*
>> + * The above does not allocate/instantiate page-tables because doing so
>> + * would lead to the possibility of instantiating page-tables after
>> + * free_pgtables() -- and consequently leaking them.
>> + *
>> + * The result is that we take at least one !speculative fault per PMD
>> + * in order to instantiate it.
>> + */
>
>
> Doing all this job and just give up because we cannot allocate page tables
> looks very wasteful to me.
>
> Have you considered to look how we can hand over from speculative to
> non-speculative path without starting from scratch (when possible)?
>
>> + /* Transparent huge pages are not supported. */
>> + if (unlikely(pmd_trans_huge(*pmd)))
>> + goto out_walk;
>
> That's looks like a blocker to me.
>
> Is there any problem with making it supported (besides plain coding)?
This is not straight forward, as the THP are mainly handled in
__handle_mm_fault() and it is not called during the speculative path.
Having THP handled in the speculative path sounds doable but I'd have to
double check all the callees deeper, and this will required either
redesigning __handle_mm_fault() or doing the job in a dedicated way in
handle_speculative_fault() .
Furthermore, we should handle both PUD and PMD's level huge pages.
This being said, I can't see any blocking issue at this time except plain
coding but I'd prefer to get it done in a next step, as an optimization,
since huge page's faults are far less frequent per design.
Having _standard_ page's fault handled in a speculative way is already
providing good performance improvement, we should consider having it
upstreamed and then adding support for THP as well as other compatible
vm_ops like hugetlb, isn't it ?
Cheers,
Laurent.
>> +
>> + vmf.vma = vma;
>> + vmf.pmd = pmd;
>> + vmf.pgoff = linear_page_index(vma, address);
>> + vmf.gfp_mask = __get_fault_gfp_mask(vma);
>> + vmf.sequence = seq;
>> + vmf.flags = flags;
>> +
>> + local_irq_enable();
>> +
>> + /*
>> + * We need to re-validate the VMA after checking the bounds, otherwise
>> + * we might have a false positive on the bounds.
>> + */
>> + if (read_seqcount_retry(&vma->vm_sequence, seq))
>> + goto unlock;
>> +
>> + ret = handle_pte_fault(&vmf);
>> +
>> +unlock:
>> + srcu_read_unlock(&vma_srcu, idx);
>> + return ret;
>> +
>> +out_walk:
>> + local_irq_enable();
>> + goto unlock;
>> +}
>> +#endif /* __HAVE_ARCH_CALL_SPF */
>> +
>> /*
>> * By the time we get here, we already hold the mm semaphore
>> *
>> --
>> 2.7.4
>>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-08-30 8:56 UTC|newest]
Thread overview: 61+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-08-17 22:04 [PATCH v2 00/20] Speculative page faults Laurent Dufour
2017-08-17 22:05 ` [PATCH v2 01/20] mm: Dont assume page-table invariance during faults Laurent Dufour
2017-08-17 22:05 ` [PATCH v2 02/20] mm: Prepare for FAULT_FLAG_SPECULATIVE Laurent Dufour
2017-08-17 22:05 ` [PATCH v2 03/20] mm: Introduce pte_spinlock " Laurent Dufour
2017-08-17 22:05 ` [PATCH v2 04/20] mm: VMA sequence count Laurent Dufour
2017-08-17 22:05 ` [PATCH v2 05/20] mm: Protect VMA modifications using " Laurent Dufour
2017-08-17 22:05 ` [PATCH v2 06/20] mm: RCU free VMAs Laurent Dufour
2017-08-17 22:05 ` [PATCH v2 07/20] mm: Cache some VMA fields in the vm_fault structure Laurent Dufour
2017-08-17 22:05 ` [PATCH v2 08/20] mm: Protect SPF handler against anon_vma changes Laurent Dufour
2017-08-17 22:05 ` [PATCH v2 09/20] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page() Laurent Dufour
2017-08-17 22:05 ` [PATCH v2 10/20] mm: Introduce __lru_cache_add_active_or_unevictable Laurent Dufour
2017-08-17 22:05 ` [PATCH v2 11/20] mm: Introduce __maybe_mkwrite() Laurent Dufour
2017-08-17 22:05 ` [PATCH v2 12/20] mm: Introduce __vm_normal_page() Laurent Dufour
2017-08-17 22:05 ` [PATCH v2 13/20] mm: Introduce __page_add_new_anon_rmap() Laurent Dufour
2017-08-17 22:05 ` [PATCH v2 14/20] mm: Provide speculative fault infrastructure Laurent Dufour
2017-08-20 12:11 ` Sergey Senozhatsky
2017-08-25 8:52 ` Laurent Dufour
2017-08-27 0:18 ` Kirill A. Shutemov
2017-08-28 9:37 ` Peter Zijlstra
2017-08-28 21:14 ` Benjamin Herrenschmidt
2017-08-28 22:35 ` Andi Kleen
2017-08-29 8:15 ` Peter Zijlstra
2017-08-29 8:33 ` Peter Zijlstra
2017-08-29 11:27 ` Peter Zijlstra
2017-08-29 21:19 ` Benjamin Herrenschmidt
2017-08-30 6:13 ` Peter Zijlstra
2017-08-29 7:59 ` Laurent Dufour
2017-08-29 12:04 ` Peter Zijlstra
2017-08-29 13:18 ` Laurent Dufour
2017-08-29 13:45 ` Peter Zijlstra
2017-08-30 5:03 ` Anshuman Khandual
2017-08-30 5:58 ` Peter Zijlstra
2017-08-30 9:32 ` Laurent Dufour
2017-08-31 6:55 ` Anshuman Khandual
2017-08-31 7:31 ` Peter Zijlstra
2017-08-30 9:53 ` Laurent Dufour
2017-08-30 3:48 ` Anshuman Khandual
2017-08-30 5:25 ` Anshuman Khandual
2017-08-30 8:56 ` Laurent Dufour [this message]
2017-08-17 22:05 ` [PATCH v2 15/20] mm: Try spin lock in speculative path Laurent Dufour
2017-08-17 22:05 ` [PATCH v2 16/20] mm: Adding speculative page fault failure trace events Laurent Dufour
2017-08-17 22:05 ` [PATCH v2 17/20] perf: Add a speculative page fault sw event Laurent Dufour
2017-08-21 8:55 ` Anshuman Khandual
2017-08-22 1:46 ` Michael Ellerman
2017-08-17 22:05 ` [PATCH v2 18/20] perf tools: Add support for the SPF perf event Laurent Dufour
2017-08-21 8:48 ` Anshuman Khandual
2017-08-25 8:53 ` Laurent Dufour
2017-08-17 22:05 ` [PATCH v2 19/20] x86/mm: Add speculative pagefault handling Laurent Dufour
2017-08-21 7:29 ` Anshuman Khandual
2017-08-29 14:50 ` Laurent Dufour
2017-08-29 14:58 ` Laurent Dufour
2017-08-17 22:05 ` [PATCH v2 20/20] powerpc/mm: Add speculative page fault Laurent Dufour
2017-08-21 6:58 ` Anshuman Khandual
2017-08-29 15:13 ` Laurent Dufour
2017-08-21 2:26 ` [PATCH v2 00/20] Speculative page faults Sergey Senozhatsky
2017-09-08 9:24 ` Laurent Dufour
2017-09-11 0:45 ` Sergey Senozhatsky
2017-09-11 6:28 ` Laurent Dufour
2017-08-21 6:28 ` Anshuman Khandual
2017-08-22 0:41 ` Paul E. McKenney
2017-08-25 9:41 ` Laurent Dufour
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8bd2865a-f390-9114-9852-7e32d4113016@linux.vnet.ibm.com \
--to=ldufour@linux.vnet.ibm.com \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=benh@kernel.crashing.org \
--cc=bsingharora@gmail.com \
--cc=dave@stgolabs.net \
--cc=haren@linux.vnet.ibm.com \
--cc=hpa@zytor.com \
--cc=jack@suse.cz \
--cc=khandual@linux.vnet.ibm.com \
--cc=kirill@shutemov.name \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mhocko@kernel.org \
--cc=mingo@redhat.com \
--cc=mpe@ellerman.id.au \
--cc=npiggin@gmail.com \
--cc=paulmck@linux.vnet.ibm.com \
--cc=paulus@samba.org \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=tim.c.chen@linux.intel.com \
--cc=will.deacon@arm.com \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox