Re: [HMM v16 04/15] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory v2

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Dan Williams <dan.j.williams@intel.com>
To: Jerome Glisse <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Linux MM <linux-mm@kvack.org>, John Hubbard <jhubbard@nvidia.com>,
	Ross Zwisler <ross.zwisler@linux.intel.com>
Subject: Re: [HMM v16 04/15] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory v2
Date: Mon, 16 Jan 2017 18:57:55 -0800	[thread overview]
Message-ID: <CAPcyv4i3y5vsg4Jw46U6vgO_pmkcQ8XJ7NAzQ=0sokoHj3eGjQ@mail.gmail.com> (raw)
In-Reply-To: <20170117020032.GA2309@redhat.com>

On Mon, Jan 16, 2017 at 6:00 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Mon, Jan 16, 2017 at 04:58:24PM -0800, Dan Williams wrote:
>> On Mon, Jan 16, 2017 at 12:13 PM, Jerome Glisse <jglisse@redhat.com> wrote:
>> > On Mon, Jan 16, 2017 at 11:31:39AM -0800, Dan Williams wrote:
>> [..]
>> >> >> dev_pagemap is only meant for get_user_pages() to do lookups of ptes
>> >> >> with _PAGE_DEVMAP and take a reference against the hosting device..
>> >> >
>> >> > And i want to build on top of that to extend _PAGE_DEVMAP to support
>> >> > a new usecase for unaddressable device memory.
>> >> >
>> >> >>
>> >> >> Why can't HMM use the typical vm_operations_struct fault path and push
>> >> >> more of these details to a driver rather than the core?
>> >> >
>> >> > Because the vm_operations_struct has nothing to do with the device.
>> >> > We are talking about regular vma here. Think malloc, mmap, share
>> >> > memory, ...  not about mmap(/dev/thedevice,...)
>> >> >
>> >> > So the vm_operations_struct is never under device control and we can
>> >> > not, nor want to, rely on that.
>> >>
>> >> Can you explain more what's behind that "can not, nor want to"
>> >> statement? It seems to me that any awkwardness of moving to a
>> >> standalone device file interface is less than a maintaining a new /
>> >> parallel mm fault path through dev_pagemap.
>> >
>> > The whole point of HMM is to allow transparent usage of process address
>> > space on to a device like GPU. So it imply any vma (vm_area_struct) that
>> > result from usual mmap (ie any mmap either PRIVATE or SHARE as long as it
>> > is not a an mmap of a device file).
>> >
>> > It means that application can use malloc or the usual memory allocation
>> > primitive of the langage (c++, rust, python, ...) and directly use the
>> > memory it gets from that with the device.
>>
>> So you need 100% support of all these mm paths for this hardware to be
>> useful at all? Does a separate device-driver and a userpace helper
>> library get you something like 80% of the functionality and then we
>> can debate the core mm changes to get the final 20%? Or am I just
>> completely off base with how people want to use this hardware?
>
> Can't do that. Think library want to use GPU but you do not want to update
> every single program that use that library and library get its memory from
> the application. This is just one scenario. Then you have mmaped file, or
> share memory, ...
>
> Transparent address space is where the industry is moving and sadly on some
> platform (like Intel) we can not rely on hardware to solve it for us.
>
>
>> > Device like GPU have a large pool of device memory that is not accessible
>> > by the CPU. This device memory has 10 times more bandwidth than system
>> > memory and has better latency then PCIE. Hence for the whole thing to
>> > make sense you need to allow to use it.
>> >
>> > For that you need to allow migration from system memory to device memory.
>> > Because you can not rely on special userspace allocator you have to
>> > assume that the vma (vm_area_struct) is a regular one. So we are left
>> > with having struct page for the device memory to allow migration to
>> > work without requiring too much changes to existing mm.
>> >
>> > Because device memory is not accessible by the CPU, you can not allow
>> > anyone to pin it and thus get_user_page* must trigger a migration back
>> > as CPU page fault would.
>> >
>> >
>> >> > So what we looking for here is struct page that can behave mostly
>> >> > like anyother except that we do not want to allow GUP to take a
>> >> > reference almost exactly what ZONE_DEVICE already provide.
>> >> >
>> >> > So do you have any fundamental objections to this patchset ? And if
>> >> > so, how do you propose i solve the problem i am trying to address ?
>> >> > Because hardware exist today and without something like HMM we will
>> >> > not be able to support such hardware.
>> >>
>> >> My pushback stems from it being a completely different use case for
>> >> devm_memremap_pages(), as evidenced by it growing from 4 arguments to
>> >> 9, and the ongoing maintenance overhead of understanding HMM
>> >> requirements when updating the pmem usage of ZONE_DEVICE.
>> >
>> > I rather reuse something existing and modify it to support more use case
>> > than try to add ZONE_DEVICE2 or ZONE_DEVICE_I_AM_DIFFERENT. I have made
>> > sure that my modifications to ZONE_DEVICE can be use without HMM. It is
>> > just a generic interface to support page fault and to allow to track last
>> > user of a device page. Both can be use indepentently from each other.
>> >
>> > To me the whole point of kernel is trying to share infrastructure accross
>> > as many hardware as possible and i am doing just that. I do not think HMM
>> > should be block because something that use to be for one specific use case
>> > now support 2 use cases. I am not breaking anything existing. Is it more
>> > work for you ? Maybe, but at Red Hat we intend to support it for as long
>> > as it is needed so you always have some one to talk to if you want to
>> > update ZONE_DEVICE.
>>
>> Sharing infrastructure should not come at the expense of type safety
>> and clear usage rules.
>
> And where exactly do i violate that ?

It's hard to judge without a user. For example, we found that fs/dax.c
was violating block device safety lifetime rules that we solved with
dax_map_atomic(), but that couldn't have been done without seeing both
sides of the interface.

...but as I say that I'm aware that I don't have the background in
graphics memory management like I do the block stack to review the
usages.

>> For example the pmem case, before exposing ZONE_DEVICE memory to other
>> parts of the kernel, introduced the pfn_t type to distinguish DMA
>> capable pfns from other raw pfns. All programmatic ways of discovering
>> if a pmem range can support DMA use this type and explicit flags.
>
> I am protected from this because i do not allow GUP. GUP trigger migration
> back to regular system memory.
>
>>
>> While we may not need ZONE_DEVICE2 we obviously need a different
>> wrapper around arch_add_memory() than devm_memremap_pages() for HMM
>> and likely a different physical address radix than pgmap_radix because
>> they are servicing 2 distinct purposes. For example, I don't think HMM
>> should be using unmodified arch_add_memory(). We shouldn't add
>> unaddressable memory to the linear address mappings when we know there
>> is nothing behind it, especially when it seems all you need from
>> arch_add_memory() is pfn_to_page() to be valid.
>
> And my patchset does just that, i do not add the device pfn to the linear
> mapping because there is nothing there. In arch_add_memory() x86, ppc, arm
> do barely more than setting up linear mapping and adding struct page. So
> instead of splitting in two this function i just made the linear mapping
> conditional.

Sorry, I missed that.

> I can split HMM from devm_memremap_pages() and thus from a different
> pgmap_radix. You have to understand that this will not change most of
> my patchset.
>

Sure, but I think it would worth it from a readability / maintenance
perspective. With HMM being a superset of the existing dev_pagemap()
usage it might make sense to just use struct dev_pagemap as a
sub-structure of the hmm data.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2017-01-17  2:57 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-12 16:30 [HMM v16 00/16] HMM (Heterogeneous Memory Management) v16 Jérôme Glisse
2017-01-12 16:30 ` [HMM v16 01/15] mm/memory/hotplug: convert device bool to int to allow for more flags v2 Jérôme Glisse
2017-01-13 13:57   ` Balbir Singh
2017-01-13 14:45     ` Jerome Glisse
2017-01-12 16:30 ` [HMM v16 02/15] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory v2 Jérôme Glisse
2017-01-12 16:30 ` [HMM v16 03/15] mm/ZONE_DEVICE/free-page: callback when page is freed Jérôme Glisse
2017-01-12 16:30 ` [HMM v16 04/15] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory v2 Jérôme Glisse
2017-01-16  7:05   ` Dan Williams
2017-01-16 15:17     ` Jerome Glisse
2017-01-16 19:31       ` Dan Williams
2017-01-16 20:13         ` Jerome Glisse
2017-01-17  0:58           ` Dan Williams
2017-01-17  2:00             ` Jerome Glisse
2017-01-17  2:57               ` Dan Williams [this message]
2017-01-12 16:30 ` [HMM v16 05/15] mm/ZONE_DEVICE/x86: add support for un-addressable device memory Jérôme Glisse
2017-01-12 16:30 ` [HMM v16 06/15] mm/hmm: heterogeneous memory management (HMM for short) Jérôme Glisse
2017-01-12 16:30 ` [HMM v16 07/15] mm/hmm/mirror: mirror process address space on device with HMM helpers Jérôme Glisse
2017-01-12 16:30 ` [HMM v16 08/15] mm/hmm/mirror: helper to snapshot CPU page table Jérôme Glisse
2017-01-12 16:30 ` [HMM v16 09/15] mm/hmm/mirror: device page fault handler Jérôme Glisse
2017-01-12 16:30 ` [HMM v16 10/15] mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration Jérôme Glisse
2017-01-12 16:30 ` [HMM v16 11/15] mm/hmm/migrate: add new boolean copy flag to migratepage() callback Jérôme Glisse
2017-01-12 16:30 ` [HMM v16 12/15] mm/hmm/migrate: new memory migration helper for use with device memory v2 Jérôme Glisse
2017-01-12 16:30 ` [HMM v16 13/15] mm/hmm/migrate: optimize page map once in vma being migrated Jérôme Glisse
2017-01-12 16:30 ` [HMM v16 14/15] mm/hmm/devmem: device driver helper to hotplug ZONE_DEVICE memory v2 Jérôme Glisse
2017-01-12 16:30 ` [HMM v16 15/15] mm/hmm/devmem: dummy HMM device as an helper for ZONE_DEVICE memory Jérôme Glisse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPcyv4i3y5vsg4Jw46U6vgO_pmkcQ8XJ7NAzQ=0sokoHj3eGjQ@mail.gmail.com' \
    --to=dan.j.williams@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ross.zwisler@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox