Re: [PATCHv2 08/28] mm: postpone page table allocation until do_set_pte()

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Dave Hansen <dave.hansen@intel.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Hugh Dickins <hughd@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Christoph Lameter <cl@gentwo.org>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Jerome Marchand <jmarchan@redhat.com>,
	Yang Shi <yang.shi@linaro.org>,
	Sasha Levin <sasha.levin@oracle.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCHv2 08/28] mm: postpone page table allocation until do_set_pte()
Date: Tue, 16 Feb 2016 09:38:48 -0800	[thread overview]
Message-ID: <56C35EA8.2000407@intel.com> (raw)
In-Reply-To: <20160216142657.GA16364@node.shutemov.name>

Sorry, fat-fingered the send on the last one.

On 02/16/2016 06:26 AM, Kirill A. Shutemov wrote:
> On Fri, Feb 12, 2016 at 09:44:41AM -0800, Dave Hansen wrote:
>>> +	if (unlikely(pmd_none(*fe->pmd) &&
>>> +			__pte_alloc(vma->vm_mm, vma, fe->pmd, fe->address)))
>>> +		return VM_FAULT_OOM;
>>
>> Should we just move this pmd_none() check in to __pte_alloc()?  You do
>> this same-style check at least twice.
> 
> We have it there. The check here is speculative to avoid taking ptl.

OK, that's a performance optimization.  Why shouldn't all callers of
__pte_alloc() get the same optimization?

>>> +	/* If an huge pmd materialized from under us just retry later */
>>> +	if (unlikely(pmd_trans_huge(*fe->pmd)))
>>> +		return 0;
>>
>> Nit: please stop sprinkling unlikely() everywhere.  Is there some
>> concrete benefit to doing it here?  I really doubt the compiler needs
>> help putting the code for "return 0" out-of-line.
>>
>> Why is it important to abort here?  Is this a small-page-only path?
> 
> This unlikely() was moved from __handle_mm_fault(). I didn't put much
> consideration in it.

OK, but separately from the unlikely()...  Why is it important to jump
out of this code when we see a pmd_trans_huge() pmd?

>>> +static int pte_alloc_one_map(struct fault_env *fe)
>>> +{
>>> +	struct vm_area_struct *vma = fe->vma;
>>> +
>>> +	if (!pmd_none(*fe->pmd))
>>> +		goto map_pte;
>>
>> So the calling convention here is...?  It looks like this has to be
>> called with fe->pmd == pmd_none().  If not, we assume it's pointing to a
>> pte page.  This can never be called on a huge pmd.  Right?
> 
> It's not under ptl, so pmd can be filled under us. There's huge pmd check in
> 'map_pte' goto path.

OK, could we add some comments on that?  We expect to be called to
______, but if there is a race, we might also have to handle ______, etc...?

>>> +	if (fe->prealloc_pte) {
>>> +		smp_wmb(); /* See comment in __pte_alloc() */
>>
>> Are we trying to make *this* cpu's write visible, or to see the write
>> from __pte_alloc()?  It seems like we're trying to see the write.  Isn't
>> smp_rmb() what we want for that?
> 
> See 362a61ad6119.

That patch explains that anyone allocating and initializing a page table
page must ensure that all CPUs can see the initialization writes
*before* the page can be linked into the page tables.  __pte_alloc()
performs a smp_wmb() to ensure that other processors can see its writes.

That still doesn't answer my question though.  What does this barrier
do?  What does it make visible to this processor?  __pte_alloc() already
made its initialization visible, so what's the purpose *here*?

>>> +		atomic_long_inc(&vma->vm_mm->nr_ptes);
>>> +		pmd_populate(vma->vm_mm, fe->pmd, fe->prealloc_pte);
>>> +		spin_unlock(fe->ptl);
>>> +		fe->prealloc_pte = 0;
>>> +	} else if (unlikely(__pte_alloc(vma->vm_mm, vma, fe->pmd,
>>> +					fe->address))) {
>>> +		return VM_FAULT_OOM;
>>> +	}
>>> +map_pte:
>>> +	if (unlikely(pmd_trans_huge(*fe->pmd)))
>>> +		return VM_FAULT_NOPAGE;
>>
>> I think I need a refresher on the locking rules.  pte_offset_map*() is
>> unsafe to call on a huge pmd.  What in this context makes it impossible
>> for the pmd to get promoted after the check?
> 
> Do you mean what stops pte page table to collapsed into huge pmd?
> The answer is mmap_sem. Collapse code takes the lock on write to be able to
> retract page table.

What I learned in this set is that pte_offset_map_lock() is dangerous to
call unless THPs have been excluded somehow from the PMD it's being
called on.

What I'm looking for is something to make sure that the context has been
thought through and is thoroughly THP-free.

It sounds like you've thought through all the cases, but your thoughts
aren't clear from the way the code is laid out currently.

>>> + * Caller must take care of unlocking fe->ptl, if fe->pte is non-NULL on return.
>>>   *
>>>   * Target users are page handler itself and implementations of
>>>   * vm_ops->map_pages.
>>>   */
>>> -void do_set_pte(struct fault_env *fe, struct page *page)
>>> +int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
>>> +		struct page *page)
>>>  {
>>>  	struct vm_area_struct *vma = fe->vma;
>>>  	bool write = fe->flags & FAULT_FLAG_WRITE;
>>>  	pte_t entry;
>>>  
>>> +	if (!fe->pte) {
>>> +		int ret = pte_alloc_one_map(fe);
>>> +		if (ret)
>>> +			return ret;
>>> +	}
>>> +
>>> +	if (!pte_none(*fe->pte))
>>> +		return VM_FAULT_NOPAGE;
>>
>> Oh, you've got to add another pte_none() check because you're deferring
>> the acquisition of the ptl lock?
> 
> Yes, we need to re-check once ptl is taken.

Another good comment to add, I think. :)


>>> -	/* Check if it makes any sense to call ->map_pages */
>>> -	fe->address = start_addr;
>>> -	while (!pte_none(*fe->pte)) {
>>> -		if (++start_pgoff > end_pgoff)
>>> -			goto out;
>>> -		fe->address += PAGE_SIZE;
>>> -		if (fe->address >= fe->vma->vm_end)
>>> -			goto out;
>>> -		fe->pte++;
>>> +	if (pmd_none(*fe->pmd))
>>> +		fe->prealloc_pte = pte_alloc_one(fe->vma->vm_mm, fe->address);
>>> +	fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
>>> +	if (fe->prealloc_pte) {
>>> +		pte_free(fe->vma->vm_mm, fe->prealloc_pte);
>>> +		fe->prealloc_pte = 0;
>>>  	}
>>> +	if (!fe->pte)
>>> +		goto out;
>>
>> What does !fe->pte *mean* here?  No pte page?  No pte present?  Huge pte
>> present?
> 
> Huge pmd is mapped.
> 
> Comment added.

Huh, so in _some_ contexts, !fe->pte means that we've got a huge pmd.  I
don't remember seeing that spelled out in the structure comments.



>>> +	if (unlikely(pmd_none(*fe->pmd))) {
>>> +		/*
>>> +		 * Leave __pte_alloc() until later: because vm_ops->fault may
>>> +		 * want to allocate huge page, and if we expose page table
>>> +		 * for an instant, it will be difficult to retract from
>>> +		 * concurrent faults and from rmap lookups.
>>> +		 */
>>> +	} else {
>>> +		/*
>>> +		 * A regular pmd is established and it can't morph into a huge
>>> +		 * pmd from under us anymore at this point because we hold the
>>> +		 * mmap_sem read mode and khugepaged takes it in write mode.
>>> +		 * So now it's safe to run pte_offset_map().
>>> +		 */
>>> +		fe->pte = pte_offset_map(fe->pmd, fe->address);
>>> +
>>> +		entry = *fe->pte;
>>> +		barrier();
>>
>> Barrier because....?

Did you miss a response here, Kirill?

>>> +		if (pte_none(entry)) {
>>> +			pte_unmap(fe->pte);
>>> +			fe->pte = NULL;
>>> +		}
>>> +	}
>>> +
>>>  	/*
>>>  	 * some architectures can have larger ptes than wordsize,
>>>  	 * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
>>>  	 * so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses.
>>> -	 * The code below just needs a consistent view for the ifs and
>>> +	 * The code above just needs a consistent view for the ifs and
>>>  	 * we later double check anyway with the ptl lock held. So here
>>>  	 * a barrier will do.
>>>  	 */
>>
>> Looks like the barrier got moved, but not the comment.
> 
> Moved.
> 
>> Man, that's a lot of code.
> 
> Yeah. I don't see a sensible way to split it. :-/

Can you do the "postpone allocation" parts without adding additional THP
code?  Or does the postponement just add all of the extra THP-handling
spots?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2016-02-16 17:38 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 01/28] thp, dax: do not try to withdraw pgtable from non-anon VMA Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 02/28] rmap: introduce rmap_walk_locked() Kirill A. Shutemov
2016-02-11 18:52   ` Andi Kleen
2016-02-16  9:36     ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 03/28] rmap: extend try_to_unmap() to be usable by split_huge_page() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 04/28] mm: make remove_migration_ptes() beyond mm/migration.c Kirill A. Shutemov
2016-02-12 16:54   ` Dave Hansen
2016-02-16  9:54     ` Kirill A. Shutemov
2016-02-16 15:29       ` Dave Hansen
2016-02-11 14:21 ` [PATCHv2 05/28] thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 06/28] mm: do not pass mm_struct into handle_mm_fault Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 07/28] mm: introduce fault_env Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 08/28] mm: postpone page table allocation until do_set_pte() Kirill A. Shutemov
2016-02-12 17:44   ` Dave Hansen
2016-02-16 14:26     ` Kirill A. Shutemov
2016-02-16 17:17       ` Dave Hansen
2016-02-23 13:05         ` Kirill A. Shutemov
2016-02-16 17:38       ` Dave Hansen [this message]
2016-02-23 22:58         ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 09/28] rmap: support file thp Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 10/28] mm: introduce do_set_pmd() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 11/28] mm, rmap: account file thp pages Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 12/28] thp, vmstats: add counters for huge file pages Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 13/28] thp: support file pages in zap_huge_pmd() Kirill A. Shutemov
2016-02-12 18:33   ` Dave Hansen
2016-02-16 10:00     ` Kirill A. Shutemov
2016-02-16 15:31       ` Dave Hansen
2016-02-18 12:19         ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 14/28] thp: handle file pages in split_huge_pmd() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 15/28] thp: handle file COW faults Kirill A. Shutemov
2016-02-12 18:36   ` Dave Hansen
2016-02-16 10:08     ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 16/28] thp: handle file pages in mremap() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 17/28] thp: skip file huge pmd on copy_huge_pmd() Kirill A. Shutemov
2016-02-12 18:42   ` Dave Hansen
2016-02-16 10:14     ` Kirill A. Shutemov
2016-02-16 15:46       ` Dave Hansen
2016-02-18 12:41         ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 18/28] thp: prepare change_huge_pmd() for file thp Kirill A. Shutemov
2016-02-12 18:48   ` Dave Hansen
2016-02-16 10:15     ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 19/28] thp: run vma_adjust_trans_huge() outside i_mmap_rwsem Kirill A. Shutemov
2016-02-12 18:50   ` Dave Hansen
2016-02-16 10:16     ` Kirill A. Shutemov
2016-02-16 15:49   ` Dave Hansen
2016-02-11 14:21 ` [PATCHv2 20/28] thp: file pages support for split_huge_page() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 21/28] vmscan: split file huge pages before paging them out Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 22/28] page-flags: relax policy for PG_mappedtodisk and PG_reclaim Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 23/28] radix-tree: implement radix_tree_maybe_preload_order() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 24/28] filemap: prepare find and delete operations for huge pages Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 25/28] truncate: handle file thp Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 26/28] shmem: prepare huge=N mount option and /proc/sys/vm/shmem_huge Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 27/28] shmem: get_unmapped_area align huge page Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 28/28] shmem: add huge pages support Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56C35EA8.2000407@intel.com \
    --to=dave.hansen@intel.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@gentwo.org \
    --cc=hughd@google.com \
    --cc=jmarchan@redhat.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=sasha.levin@oracle.com \
    --cc=vbabka@suse.cz \
    --cc=yang.shi@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox