From: Dan Williams <dan.j.williams@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
Dave Hansen <dave@sr71.net>,
linux-nvdimm <linux-nvdimm@ml01.01.org>,
Peter Zijlstra <peterz@infradead.org>, X86 ML <x86@kernel.org>,
Linux MM <linux-mm@kvack.org>, Ingo Molnar <mingo@redhat.com>,
Mel Gorman <mgorman@suse.de>, "H. Peter Anvin" <hpa@zytor.com>,
Thomas Gleixner <tglx@linutronix.de>,
Logan Gunthorpe <logang@deltatee.com>
Subject: Re: [-mm PATCH v2 23/25] mm, x86: get_user_pages() for dax mappings
Date: Tue, 15 Dec 2015 18:18:20 -0800 [thread overview]
Message-ID: <CAPcyv4hRmMJBBWr6dTjX05KFUE8sv6WQa0Co9h-ukHn=_8p6Ag@mail.gmail.com> (raw)
In-Reply-To: <20151215161438.e971fc9b98814513bbacb3ed@linux-foundation.org>
On Tue, Dec 15, 2015 at 4:14 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Wed, 09 Dec 2015 18:39:16 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
>
>> A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
>> has established a devm_memremap_pages() mapping, i.e. when the pfn_t
>> return from ->direct_access() has PFN_DEV and PFN_MAP set. Later, when
>> encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
>> struct dev_pagemap instance to keep the result of pfn_to_page() valid
>> until put_page().
>
> This patch adds a whole bunch of code and cycles to everyone's kernels,
> but few of those kernels will ever use it. What are our options for
> reducing that overhead, presumably via Kconfig?
It's does compile out when CONFIG_ZONE_DEVICE=n.
>> --- a/arch/x86/mm/gup.c
>> +++ b/arch/x86/mm/gup.c
>> @@ -63,6 +63,16 @@ retry:
>> #endif
>> }
>>
>> +static void undo_dev_pagemap(int *nr, int nr_start, struct page **pages)
>> +{
>> + while ((*nr) - nr_start) {
>> + struct page *page = pages[--(*nr)];
>> +
>> + ClearPageReferenced(page);
>> + put_page(page);
>> + }
>> +}
>
> PG_referenced is doing something magical in this code. Could we have a
> nice comment explaining its meaning in this context? Unless it's
> already there and I missed it..
In this case I'm just duplicating what gup_pte_range() already does
with normal memory pages, no special meaning for zone_device pages.
>
>>
>> ...
>>
>> @@ -830,6 +831,20 @@ static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
>> percpu_ref_put(pgmap->ref);
>> }
>>
>> +static inline void get_page(struct page *page)
>> +{
>> + if (is_zone_device_page(page))
>> + percpu_ref_get(page->pgmap->ref);
>> +
>> + page = compound_head(page);
>
> So we're assuming that is_zone_device_page() against a tail page works
> OK. That's presently true, fingers crossed for the future...
>
> And we're also assuming that device pages are never compound. How safe
> is that assumption?
These pages are never on a slab lru or touched by the mm outside of
get_user_pages() and the I/O path for dma mapping. It's a bug if a
zone_device page is anything but a standalone vanilla page.
>> + /*
>> + * Getting a normal page or the head of a compound page
>> + * requires to already have an elevated page->_count.
>> + */
>> + VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
>> + atomic_inc(&page->_count);
>> +}
>
> The core pagecache lookup bypasses get_page() by using
> page_cache_get_speculative(), but get_page() might be a hotpath for
> some workloads. Here we're adding quite a bit of text and math and a
> branch. I'm counting 157 callsites.
>
> So this is a rather unwelcome change. Why do we need to alter such a
> generic function as get_page() anyway? Is there some way to avoid
> altering it?
The minimum requirement is to pin down the driver while the pages it
allocated might be in active use for dma or other i/o. We could meet
this requirement by simply pinning the driver while any dax vma is
open.
The downside of this is that it blocks device driver unbind
indefinitely while a dax vma is established rather than only
temporarily blocking unbind/remove for the lifetime of an O_DIRECT I/O
operation.
This goes back to a recurring debate about whether pmem is "memory" or
a "storage-device / disk".
pmem as memory => pin it active while any vma is established
pmem as storage device => pin it active only while i/o is in flight
and forcibly revoke mappings on an unbind/remove event.
>> +static void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
>> + pmd_t *pmd)
>> +{
>> + pmd_t _pmd;
>> +
>> + /*
>> + * We should set the dirty bit only for FOLL_WRITE but for now
>> + * the dirty bit in the pmd is meaningless. And if the dirty
>> + * bit will become meaningful and we'll only set it with
>> + * FOLL_WRITE, an atomic set_bit will be required on the pmd to
>> + * set the young bit, instead of the current set_pmd_at.
>> + */
>> + _pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
>> + if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
>> + pmd, _pmd, 1))
>> + update_mmu_cache_pmd(vma, addr, pmd);
>> +}
>> +
>> +struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
>> + pmd_t *pmd, int flags)
>> +{
>> + unsigned long pfn = pmd_pfn(*pmd);
>> + struct mm_struct *mm = vma->vm_mm;
>> + struct dev_pagemap *pgmap;
>> + struct page *page;
>> +
>> + assert_spin_locked(pmd_lockptr(mm, pmd));
>> +
>> + if (flags & FOLL_WRITE && !pmd_write(*pmd))
>> + return NULL;
>> +
>> + if (pmd_present(*pmd) && pmd_devmap(*pmd))
>> + /* pass */;
>> + else
>> + return NULL;
>> +
>> + if (flags & FOLL_TOUCH)
>> + touch_pmd(vma, addr, pmd);
>> +
>> + /*
>> + * device mapped pages can only be returned if the
>> + * caller will manage the page reference count.
>> + */
>> + if (!(flags & FOLL_GET))
>> + return ERR_PTR(-EEXIST);
>> +
>> + pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
>> + pgmap = get_dev_pagemap(pfn, NULL);
>> + if (!pgmap)
>> + return ERR_PTR(-EFAULT);
>> + page = pfn_to_page(pfn);
>> + get_page(page);
>> + put_dev_pagemap(pgmap);
>> +
>> + return page;
>> +}
>
> hm, so device pages can be huge. How does this play with get_page()'s
> assumption that the pages cannot be compound?
We don't need to track the pages as a compound unit because they'll
never hit mm paths that care about the distinction. They are
effectively never "onlined".
> And again, this is bloating up the kernel for not-widely-used stuff.
I suspect the ability to compile it out is little comfort since we're
looking to get CONFIG_ZONE_DEVICE enabled by default in major distros.
If that's the case I'm wiling to entertain the coarse pinning route.
We can always circle back for the finer grained option if a problem
arises, but let me know if CONFIG_ZONE_DEVICE=n was all you were
looking for...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2015-12-16 2:18 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-12-10 2:37 [-mm PATCH v2 00/25] get_user_pages() for dax pte and pmd mappings Dan Williams
2015-12-10 2:37 ` [-mm PATCH v2 01/25] pmem, dax: clean up clear_pmem() Dan Williams
2015-12-10 2:37 ` [-mm PATCH v2 02/25] dax: increase granularity of dax_clear_blocks() operations Dan Williams
2015-12-10 2:37 ` [-mm PATCH v2 03/25] dax: guarantee page aligned results from bdev_direct_access() Dan Williams
2015-12-10 2:37 ` [-mm PATCH v2 04/25] dax: fix lifetime of in-kernel dax mappings with dax_map_atomic() Dan Williams
2015-12-11 18:11 ` [-mm PATCH v3 " Dan Williams
2015-12-17 22:00 ` Ross Zwisler
2015-12-17 22:16 ` Dan Williams
2015-12-10 2:37 ` [-mm PATCH v2 05/25] mm, dax: fix livelock, allow dax pmd mappings to become writeable Dan Williams
2015-12-10 2:37 ` [-mm PATCH v2 06/25] dax: Split pmd map when fallback on COW Dan Williams
2015-12-10 2:37 ` [-mm PATCH v2 07/25] um: kill pfn_t Dan Williams
2015-12-10 2:37 ` [-mm PATCH v2 08/25] kvm: rename pfn_t to kvm_pfn_t Dan Williams
2015-12-10 2:37 ` [-mm PATCH v2 09/25] mm, dax, pmem: introduce pfn_t Dan Williams
2015-12-11 18:22 ` [-mm PATCH v3 " Dan Williams
2015-12-10 2:38 ` [-mm PATCH v2 10/25] mm: introduce find_dev_pagemap() Dan Williams
2015-12-11 18:27 ` [-mm PATCH v3 " Dan Williams
2015-12-10 2:38 ` [-mm PATCH v2 11/25] x86, mm: introduce vmem_altmap to augment vmemmap_populate() Dan Williams
2015-12-15 16:50 ` Dan Williams
2015-12-15 23:28 ` Andrew Morton
2015-12-15 23:37 ` Dan Williams
2015-12-10 2:38 ` [-mm PATCH v2 12/25] libnvdimm, pfn, pmem: allocate memmap array in persistent memory Dan Williams
2015-12-10 2:38 ` [-mm PATCH v2 13/25] avr32: convert to asm-generic/memory_model.h Dan Williams
2015-12-10 2:38 ` [-mm PATCH v2 14/25] hugetlb: fix compile error on tile Dan Williams
2015-12-10 2:38 ` [-mm PATCH v2 15/25] frv: fix compiler warning from definition of __pmd() Dan Williams
2015-12-10 2:38 ` [-mm PATCH v2 16/25] x86, mm: introduce _PAGE_DEVMAP Dan Williams
2015-12-10 2:38 ` [-mm PATCH v2 17/25] mm, dax, gpu: convert vm_insert_mixed to pfn_t Dan Williams
2015-12-10 2:38 ` [-mm PATCH v2 18/25] mm, dax: convert vmf_insert_pfn_pmd() " Dan Williams
2015-12-10 2:38 ` [-mm PATCH v2 19/25] list: introduce list_del_poison() Dan Williams
2015-12-15 23:41 ` Andrew Morton
2015-12-16 0:17 ` Dan Williams
2015-12-10 2:39 ` [-mm PATCH v2 20/25] libnvdimm, pmem: move request_queue allocation earlier in probe Dan Williams
2015-12-10 2:39 ` [-mm PATCH v2 21/25] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup Dan Williams
2015-12-15 23:46 ` Andrew Morton
2015-12-10 2:39 ` [-mm PATCH v2 22/25] mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd Dan Williams
2015-12-10 2:39 ` [-mm PATCH v2 23/25] mm, x86: get_user_pages() for dax mappings Dan Williams
2015-12-16 0:14 ` Andrew Morton
2015-12-16 2:18 ` Dan Williams [this message]
2015-12-18 0:09 ` Dan Williams
2015-12-10 2:39 ` [-mm PATCH v2 24/25] dax: provide diagnostics for pmd mapping failures Dan Williams
2015-12-10 2:39 ` [-mm PATCH v2 25/25] dax: re-enable dax pmd mappings Dan Williams
2015-12-10 18:08 ` [-mm PATCH v2 00/25] get_user_pages() for dax pte and " Jeff Moyer
2015-12-10 18:56 ` Dan Williams
2015-12-10 19:20 ` Jeff Moyer
2015-12-11 2:03 ` Dan Williams
2015-12-14 14:52 ` Jeff Moyer
2015-12-14 16:44 ` Dan Williams
2015-12-11 18:44 ` Dan Williams
2015-12-15 1:59 ` Dan Williams
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAPcyv4hRmMJBBWr6dTjX05KFUE8sv6WQa0Co9h-ukHn=_8p6Ag@mail.gmail.com' \
--to=dan.j.williams@intel.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=dave@sr71.net \
--cc=hpa@zytor.com \
--cc=linux-mm@kvack.org \
--cc=linux-nvdimm@ml01.01.org \
--cc=logang@deltatee.com \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox