From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id EA1777AD for ; Mon, 3 Aug 2015 18:28:58 +0000 (UTC) Received: from mail-qg0-f48.google.com (mail-qg0-f48.google.com [209.85.192.48]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 6957D22D for ; Mon, 3 Aug 2015 18:28:58 +0000 (UTC) Received: by qgii95 with SMTP id i95so94645197qgi.2 for ; Mon, 03 Aug 2015 11:28:57 -0700 (PDT) Date: Mon, 3 Aug 2015 14:28:54 -0400 From: Jerome Glisse To: Joerg Roedel Message-ID: <20150803182853.GB2981@gmail.com> References: <20150730130027.GA14980@8bytes.org> <55BB8BB2.2090809@redhat.com> <20150731161304.GA2039@redhat.com> <20150801155728.GC14980@8bytes.org> <20150801190847.GA2704@gmail.com> <20150803160203.GJ14980@8bytes.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20150803160203.GJ14980@8bytes.org> Cc: ksummit-discuss@lists.linuxfoundation.org Subject: Re: [Ksummit-discuss] [CORE TOPIC] Core Kernel support for Compute-Offload Devices List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Mon, Aug 03, 2015 at 06:02:03PM +0200, Joerg Roedel wrote: > Hi Jerome, > > On Sat, Aug 01, 2015 at 03:08:48PM -0400, Jerome Glisse wrote: > > It is definitly worth a discussion but i fear right now there is little > > room for anything in the kernel. Hardware scheduling is done is almost > > 100% hardware. The idea of GPU is that you have 1000 compute unit but > > the hardware keep track of 10000 threads and at any point in time there > > is huge probability that 1000 of those 10000 threads are ready to compute > > something. So if a job is only using 60% of the GPU then the remaining > > 40% would automaticly be use by the next batch of threads. This is a > > simplification as the number of thread the hw can keep track of depend > > of several factor and vary from one model to the other even inside same > > family of the same manufacturer. > > So the hardware scheduled individual threads, that is right. But still, > as you say, there are limits of how many threads the hardware can handle > which the device driver needs to take care of, and decide which job will > be sent to the offload device next. Same with the priorities for the > queues. What i was pointing to is that right now you do not have such granularity of choice from device driver point of view. Right now it is either let command queue spawn thread or not. So it is either stop a command queue or let it run. Thought how and when you can stop a queue vary. In some hw you can only stop it at execution boundary ie you have a packet in a command queue that request 500k thread to be launch, you can only stop that queue once the 500k thread are launch and you can not stop in the middle. Given that some of those queue a programmed directly from userspace, you can not even force the queue to only schedule small batches of thread (ie something like 1000 thread no more per command packet in the queue). But newer hw is becoming more capable on that front. > > > > Some devices might provide that information, see the extended-access bit > > > of Intel VT-d. > > > > This would be limited to integrated GPU and so far only on one platform. > > My point was more that userspace have way more informations to make good > > decision here. The userspace program is more likely to know what part of > > the dataset gonna be repeatedly access by the GPU threads. > > Hmm, so what is the point of HMM then? If userspace is going to decide > which part of the address space the device needs it could just copy the > data over (keeping the address space layout and thus the pointers > stable) and you would basically achieve the same without adding a lot of > code to memory manangement, no? Well no, you can not be "transparent" if you do it in userspace. Let say userspace decide to migrate, then it means CPU will not be able to access that memory, so you have to either PROT_NONE the range or unmap it. This is not what we want. If we get CPU access to memory that is migrated to device memory then we want to migrate it back (at very least one page of it) so CPU can access it. We want this migration back to be transparent from process point of view, like if the memory was swapped on a disk. Even on hw where the CPU can access device memory properly (maintaining CPU atomic operation for instance which is not the case on PCIe) like with CAPI on powerpc. You have to either have struct page for the device memory of the kernel must know how to handle those special range of memory. So here HMM never makes any decission, it just leave that with the device driver that can gather more informations from hw and user space to make the best decision. But it might get things wrong or user space program might do stupid thing like trying to access data set with the CPU while the GPU is churning on it. Still we do not want CPU access to be handle as fault or forbidden, when this happen HMM will force migration back to service CPU page fault. HMM also intends to provide more features that are not doable from user space. Like exclusive write access on a range for the device so that device can perform atomic operation. Again PCIe offer limited capabilities on atomic so only way to provide more advance atomic operations is to map thing read only for CPU and other devices while atomic operation on a device progress. Another feature is sharing device memory btw different devices. Some devices (not necessarily from the same manufacturer) can communicate to one another and access one another device memory. When a range is migrated on one of such device pair, there must be a way for the other device to find about it. Having userspace device driver try to exchange that kind of information is racy in many way. So easier and better to have it in kernel. Cheers, Jérôme