From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 722C2C4167B for ; Wed, 29 Nov 2023 05:15:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 909256B03A2; Wed, 29 Nov 2023 00:15:14 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 893486B03A6; Wed, 29 Nov 2023 00:15:14 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 70BAA6B03A8; Wed, 29 Nov 2023 00:15:14 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 5C8AC6B03A2 for ; Wed, 29 Nov 2023 00:15:14 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 2AC4440485 for ; Wed, 29 Nov 2023 05:15:14 +0000 (UTC) X-FDA: 81509828148.13.C931428 Received: from mail-ed1-f45.google.com (mail-ed1-f45.google.com [209.85.208.45]) by imf17.hostedemail.com (Postfix) with ESMTP id 5D82040014 for ; Wed, 29 Nov 2023 05:15:12 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Xil2Jvfl; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf17.hostedemail.com: domain of airlied@gmail.com designates 209.85.208.45 as permitted sender) smtp.mailfrom=airlied@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701234912; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7Q9CRtsR/lkt7Cw/fHi7bDvZN/itsXqjiup0ZTmySxA=; b=i1YFwaoUY7Bckfw1vkconHgztbKcUsJSS3ZAaYzBNEaCZVnKRnFOvuaLqgJmYj4CwhJJeW n4nacO85lZa5j03/X7ak017r5PrpWJrepDaA/H26zcPdIm6dtbyCHxhH0b7ubNItQCQb0w UawOZ+yZmYHg0qR3yFui/fABoqJTrRI= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Xil2Jvfl; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf17.hostedemail.com: domain of airlied@gmail.com designates 209.85.208.45 as permitted sender) smtp.mailfrom=airlied@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701234912; a=rsa-sha256; cv=none; b=PmpxgVokJ9qdE5NQ/Q7ygvg8GDH6yN0uP5YLJxhUhRmMUfXe8XJ6Ti6tYsukD+pD8C51Ee CW8j4yBcJNPNddvSgj0KayRkhHt56zFjvxEVyD0AolXatV19brNGZoV6CIs9oDggcC6y2f YlXCvoQ8U9MhcegRSbfPafAOLgq4GXw= Received: by mail-ed1-f45.google.com with SMTP id 4fb4d7f45d1cf-54af2498e85so6844191a12.0 for ; Tue, 28 Nov 2023 21:15:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701234911; x=1701839711; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=7Q9CRtsR/lkt7Cw/fHi7bDvZN/itsXqjiup0ZTmySxA=; b=Xil2JvflYFkPFBd35hByRfbEI5wibM8JToIL6LO/1nXH927JM45MdUlGdvDWgeUQEL N+JM8m2xC5fI77oFpetPrjEKXxuRnDVIZohmYBzeUQymI9ENKnLL5XLLtI4t9A+eQFLx QiDEKd8BnTRsi/4HAT3qz+x9WE6L6m/dVm28vqFErSrf720cmbVrnycb9wR3NwlN5Tau IhsGrDohuQGYsG5f8kiWw0NHTYU7U8iAKZjngBGW7jTNS87GARRhTBzSsaq0mNVr5czP FW80DdO/b/rRxSkl0TWW6RExr3uD6CCbNvRJIX3RuO2Dmm7srH9ufnfHSMhWJ3fWpsnk Gafg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701234911; x=1701839711; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=7Q9CRtsR/lkt7Cw/fHi7bDvZN/itsXqjiup0ZTmySxA=; b=ZPjVVEh8+yTxqjwe7+83GMvGJdhnqTT+c4/wBVWOJqXds+zTwPivA7HWvdItC0//fi Zy5aH/lPAg2h2rVcybFSnhDeUZcnMbWjPc/FtttAYFARRBOhPKzL9ilzJt897ySIqNyN e+PtA6OTLcW42dZheUckA0dPRXJE9Vp/o+tUKbOLhYmDcnVrYP4xj4glYjYil2OrrjvN vu8dH2JK63XyBNAvq5mD/N7c4+HfYgYtepJdl/26fBU26L25gpQkd8G1a4BgOEpSYlVn dLAbg7pBaccbK7dfJK4aXSkevU9ys6CCnJ3ga+nzK1N6x+X+JtjA0zofsE7eCxLsk5ma Q87w== X-Gm-Message-State: AOJu0YyKVnRUiIaxLy6qRMfncxxvBo7Dz+RNW36F0/CWGuyQV5uRpe3Y If5tA/KAh/AFUTaZRQkfEfG99cEl+vd3rA47IgU= X-Google-Smtp-Source: AGHT+IF8BQbzvC8JUMcPZoTIvdOo/gY/kUL91mdf8zioqThU2TU+yw6ExoLa9toWLp2NMd4yrs5b7C6bb/cyNxAjx+o= X-Received: by 2002:a17:906:5299:b0:a00:8706:c82e with SMTP id c25-20020a170906529900b00a008706c82emr13477614ejm.18.1701234910430; Tue, 28 Nov 2023 21:15:10 -0800 (PST) MIME-Version: 1.0 References: <20231128125025.4449-1-weixi.zhu@huawei.com> <9308a79d-e312-4e6d-98fe-75dc6d0fbeda@amd.com> In-Reply-To: <9308a79d-e312-4e6d-98fe-75dc6d0fbeda@amd.com> From: Dave Airlie Date: Wed, 29 Nov 2023 15:14:58 +1000 Message-ID: Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices To: =?UTF-8?Q?Christian_K=C3=B6nig?= Cc: Weixi Zhu , linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, weixi.zhu@openeuler.sh, mgorman@suse.de, jglisse@redhat.com, rcampbell@nvidia.com, jhubbard@nvidia.com, apopple@nvidia.com, mhairgrove@nvidia.com, ziy@nvidia.com, alexander.deucher@amd.com, Xinhui.Pan@amd.com, amd-gfx@lists.freedesktop.org, Felix.Kuehling@amd.com, ogabbay@kernel.org, dri-devel@lists.freedesktop.org, jgg@nvidia.com, leonro@nvidia.com, zhenyuw@linux.intel.com, zhi.a.wang@intel.com, intel-gvt-dev@lists.freedesktop.org, intel-gfx@lists.freedesktop.org, jani.nikula@linux.intel.com, joonas.lahtinen@linux.intel.com, rodrigo.vivi@intel.com, tvrtko.ursulin@linux.intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 5D82040014 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: njjyhjstjfbjutejkk9wy1yfum7zyct9 X-HE-Tag: 1701234912-238607 X-HE-Meta: U2FsdGVkX1/f0Ju97m5ZqBkOuNXHe06IkS8910JKPewwct81kMfwx9HwmhZcqP7Tcib7+7YdWkvP8r9+hJscEaZxdFt+2HuhKrUnJCY+93j9aJeoEW5DmxbYAR4YT0TOB+tr79LiHsWumWEh7i2xt2kox5Zo76RKq/mcmdZ3ogB/H7S1XyXGLAYResIvTrLChArgHhuNiwxaHXLbkl6HEfo6Ze2VewUz6BxQqmB0kZ7CjTlnqC7LwlmKWpJHFXX7VqtayhptUuSoDg2tN2anDAtgYBur98LgDYjKziQX9lw1zMh4Z5ppPsEh75VAgQCuptJnIWdV6NAmi1LzgcKWco4NYOhO5NZsM8eFKJM2PS1ntaBDeCpK6gbogKpyxcMdHdr1IQ5YWBCfnc+uYJSbe1Io6tceoid7vFdOPYx03fIeDKzDAveEuZP06QXuZ7hgu6DjxTdOLIIjvbcgn2SKwKNkAfn8DKG/1UjPqpkO3qqqWGAkjBIEIK2ijYSecaLNLOHFnBakgbmuA4vgKggHn2ml4KwCbMrF4yW8XJv6aXO/jfe0bUyxyJaSBdHer068vcQTHoZRxkUs8Z3PyHV346Fn7G2gJWm458NF0frhoz/VQe0CyPtlNAvfghltr6K/NxZOQqtCq3zVApLg69Cvq2c5cId5YsqLjpMlhraDqp6GmMLlrGFyL5xBHbeFsB3xrZozHWIxl9clavP5M/5GPLuUQbw77NVPSgWCOZtr9pOXA14SHVoF7lAcDtghNkVrnwy15NrCahfSNqnD0fEqk1mdq1s31f3MpXhgsRPlYN4ew3qTOE3dm+wlZkf1od8n3IQ/IDzR1VJGTKL9fX/TvrtcqHi1Qhn0JJZvONSj1MC2rDIMzk9pbrUgpHCdbHRKeWbN8Y9KysYYDwRlfIwGYsUuG2bZ2KBTYtuoa0+FI2PwqfQRVG8Aw1dcLdCmOAEGK3LuQ5UkEYLIKoso9bV 9NhasnwQ pAX0KTQynhqlGIjl9o5+4E72wWtYM3dKi8N/LRDgse6C9OofaJAif+occjs+8AZy5TRO4nhmxgfS1TAFqBU3+bDSxdY3GBGrglmEEcyEUuJGndxT45iFwOBqNl5WlDvWJiLg5+GYFNDQUO+eT60wdBYHR8PJkbqe72DRhANLw1fvtJ5uUjOO1zUF865sHEMtfIT0XWHA4r447LPUWnDyiqft3QJOkC3ID1c/KNhHiiagrY3eVEQYv3r522X8+4yHhgcC+L0Z/RY8ESKcfteqsfTXpAykndPXsTcKbQySo9QgG7YAwLYyDbN5OOHCxc71/k88PKZLrf3nXqtjeKSeTh+mAdK/ruEmIEcnaboAPfJYz3TmWXa08IKidOQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, 28 Nov 2023 at 23:07, Christian K=C3=B6nig wrote: > > Am 28.11.23 um 13:50 schrieb Weixi Zhu: > > The problem: > > > > Accelerator driver developers are forced to reinvent external MM subsys= tems > > case by case, because Linux core MM only considers host memory resource= s. > > These reinvented MM subsystems have similar orders of magnitude of LoC = as > > Linux MM (80K), e.g. Nvidia-UVM has 70K, AMD GPU has 14K and Huawei NPU= has > > 30K. Meanwhile, more and more vendors are implementing their own > > accelerators, e.g. Microsoft's Maia 100. At the same time, > > application-level developers suffer from poor programmability -- they m= ust > > consider parallel address spaces and be careful about the limited devic= e > > DRAM capacity. This can be alleviated if a malloc()-ed virtual address = can > > be shared by the accelerator, or the abundant host DRAM can further > > transparently backup the device local memory. > > > > These external MM systems share similar mechanisms except for the > > hardware-dependent part, so reinventing them is effectively introducing > > redundant code (14K~70K for each case). Such developing/maintaining is = not > > cheap. Furthermore, to share a malloc()-ed virtual address, device driv= ers > > need to deeply interact with Linux MM via low-level MM APIs, e.g. MMU > > notifiers/HMM. This raises the bar for driver development, since develo= pers > > must understand how Linux MM works. Further, it creates code maintenanc= e > > problems -- any changes to Linux MM potentially require coordinated cha= nges > > to accelerator drivers using low-level MM APIs. > > > > Putting a cache-coherent bus between host and device will not make thes= e > > external MM subsystems disappear. For example, a throughput-oriented > > accelerator will not tolerate executing heavy memory access workload wi= th > > a host MMU/IOMMU via a remote bus. Therefore, devices will still have > > their own MMU and pick a simpler page table format for lower address > > translation overhead, requiring external MM subsystems. > > > > -------------------- > > > > What GMEM (Generalized Memory Management [1]) does: > > > > GMEM extends Linux MM to share its machine-independent MM code. Only > > high-level interface is provided for device drivers. This prevents > > accelerator drivers from reinventing the wheel, but relies on drivers t= o > > implement their hardware-dependent functions declared by GMEM. GMEM's k= ey > > interface include gm_dev_create(), gm_as_create(), gm_as_attach() and > > gm_dev_register_physmem(). Here briefly describe how a device driver > > utilizes them: > > 1. At boot time, call gm_dev_create() and registers the implementation = of > > hardware-dependent functions as declared in struct gm_mmu. > > - If the device has local DRAM, call gm_dev_register_physmem() to > > register available physical addresses. > > 2. When a device context is initialized (e.g. triggered by ioctl), chec= k if > > the current CPU process has been attached to a gmem address space > > (struct gm_as). If not, call gm_as_create() and point current->mm->= gm_as > > to it. > > 3. Call gm_as_attach() to attach the device context to a gmem address s= pace. > > 4. Invoke gm_dev_fault() to resolve a page fault or prepare data before > > device computation happens. > > > > GMEM has changed the following assumptions in Linux MM: > > 1. An mm_struct not only handle a single CPU context, but may also h= andle > > external memory contexts encapsulated as gm_context listed in > > mm->gm_as. An external memory context can include a few or all of= the > > following parts: an external MMU (that requires TLB invalidation)= , an > > external page table (that requires PTE manipulation) and external= DRAM > > (that requires physical memory management). > > Well that is pretty much exactly what AMD has already proposed with KFD > and was rejected for rather good reasons. > > > > MMU functions > > The MMU functions peer_map() and peer_unmap() overlap other functions, > > leaving a question if the MMU functions should be decoupled as more bas= ic > > operations. Decoupling them could potentially prevent device drivers > > coalescing these basic steps within a single host-device communication > > operation, while coupling them makes it more difficult for device drive= rs > > to utilize GMEM interface. > > Well to be honest all of this sounds like history to me. We have already > seen the same basic approach in KFD, HMM and to some extend in TTM as wel= l. > > And all of them more or less failed. Why should this here be different? Any info we have on why this has failed to work in the past would be useful to provide. This is one of those cases where we may not have documented the bad ideas to stop future developers from thinking they are bad. I do think we would want more common code in this area, but I would think we'd have it more on the driver infrastructure side, than in the core mm. Dave.