From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F29DEC4167B for ; Sat, 2 Dec 2023 14:50:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 05AE26B04A0; Sat, 2 Dec 2023 09:50:52 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id F266A6B04A4; Sat, 2 Dec 2023 09:50:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DC6ED6B04A5; Sat, 2 Dec 2023 09:50:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id C667A6B04A0 for ; Sat, 2 Dec 2023 09:50:51 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 826A5120222 for ; Sat, 2 Dec 2023 14:50:51 +0000 (UTC) X-FDA: 81522165102.18.3E750B8 Received: from mail-vs1-f45.google.com (mail-vs1-f45.google.com [209.85.217.45]) by imf24.hostedemail.com (Postfix) with ESMTP id A252B18000D for ; Sat, 2 Dec 2023 14:50:49 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=EN3AeiYQ; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf24.hostedemail.com: domain of pedro.falcato@gmail.com designates 209.85.217.45 as permitted sender) smtp.mailfrom=pedro.falcato@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701528649; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1ixg28vCNyeYFge8+qwTnb3fR5Cn16tF+NefpRupJIo=; b=Q0cJfJ16b/rJTJTTOJsr4/tOOxf20NEda7X8snhRJTFzZi6M0VOLXKx7UxLFJDlw67Y/yB +IVJSo7P8nFc4Jesm1F2/rdtCDFcy1hkyPDGZM2xrp0a6E/wX6zvv9uqLF9kCFBeSNVm9T g05IZJdZKsFRESrEflaF4OP+Zk4GyRU= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=EN3AeiYQ; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf24.hostedemail.com: domain of pedro.falcato@gmail.com designates 209.85.217.45 as permitted sender) smtp.mailfrom=pedro.falcato@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701528649; a=rsa-sha256; cv=none; b=1mM7ZZ2r9WP+9+zPV+kgDcRuv0sTRg0utUU/8NOUlasQAyFm1aDAgum7b+CFIehD5GVsB8 A0lGNs1sf3TvjgFk0IJTnIjyG7a98/RwQe60t4Xl/511ma9+/GiFZD4FVNHDH7l5W+39L0 pkGITeH1kjfpSyQTn8YITgXjS+YKFU4= Received: by mail-vs1-f45.google.com with SMTP id ada2fe7eead31-4647ed7941aso46390137.1 for ; Sat, 02 Dec 2023 06:50:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701528648; x=1702133448; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=1ixg28vCNyeYFge8+qwTnb3fR5Cn16tF+NefpRupJIo=; b=EN3AeiYQW/p9Qw+esmQ40QdzmhiAO+YRjV8g+3VSb1nPl2KQ+0zr/kpXgRoh73NJxz dMigtt4fmh+hoQT/HfgFV7P2SNgQiJy09pj6ia3ciD2cG50KqnaWprcwZ82e8389j1CG YUPtG5AMWS9buv6Tx0DBM4MNu04Mz257yBqBYTjJW1PiKCrkKxI4AuCkchx+fZF6RyoM jZKMCZcpS2PEVBTTTCMr/jnN8S29zIW7FSlQPLvA/Z1fWSbaYTPtNBlRrr9c+ZcdEQFH mE65Kh8PohjkQTTwE71Y3PzlZhMjosix4OU1DGZZ8eobEk5lPSpC0polHsB8fyJdOEMZ jIeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701528648; x=1702133448; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=1ixg28vCNyeYFge8+qwTnb3fR5Cn16tF+NefpRupJIo=; b=jaPu9pBOAoWnNIo2BKgSxoPUfhSjsbI7vFIdIWNV4jP9dB+5VNrsqJBz3btibmY+s0 Z7A6+dlKiDxrEP+fwA7c6PtZ2xj1kQEeA/Ff9uMlYhUVwOLjuBjFfhTugAjlnBwRsH72 6bVvpian/ikvsoBBhUIbdKfPPKVFjizUXvLpW5wM/xrZFZVR1xw0zfYk2nCb6/Z5SsLV YXxjP2xBKGNt4uWNv/JduqgmtNl3HOhPNR9zgb8vMFawBsTtKY0dhjXh3hRDEgMRhyc0 sKpyzxASVg9Qu1xQljc/Kd0JBixQRUYovZFsU5DZyUbMevTqabndAuYLu2fiUCg7jEPR fN3g== X-Gm-Message-State: AOJu0YyWgGX86WB8UniHrqmcbtOEUE+q3F/gv8d5Jx01NOOBaPCgA5bG C5YKOBOfsSzCtEoXH3IAeVxj9etajYdymNkxNhk= X-Google-Smtp-Source: AGHT+IE8qec2C6k83JGXnys0GKx41rq3i+6tLZnIp70pSrjs+tyidT8f0JZ+1VYuv6XjXPDL9j5GVHRqH6qDQRBzggY= X-Received: by 2002:a67:e989:0:b0:464:4aca:51e8 with SMTP id b9-20020a67e989000000b004644aca51e8mr431734vso.35.1701528648668; Sat, 02 Dec 2023 06:50:48 -0800 (PST) MIME-Version: 1.0 References: <20231128125025.4449-1-weixi.zhu@huawei.com> <20231128125025.4449-3-weixi.zhu@huawei.com> In-Reply-To: From: Pedro Falcato Date: Sat, 2 Dec 2023 14:50:37 +0000 Message-ID: Subject: Re: [RFC PATCH 2/6] mm/gmem: add arch-independent abstraction to track address mapping status To: David Hildenbrand Cc: Weixi Zhu , linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, weixi.zhu@openeuler.sh, mgorman@suse.de, jglisse@redhat.com, rcampbell@nvidia.com, jhubbard@nvidia.com, apopple@nvidia.com, mhairgrove@nvidia.com, ziy@nvidia.com, alexander.deucher@amd.com, christian.koenig@amd.com, Xinhui.Pan@amd.com, amd-gfx@lists.freedesktop.org, Felix.Kuehling@amd.com, ogabbay@kernel.org, dri-devel@lists.freedesktop.org, jgg@nvidia.com, leonro@nvidia.com, zhenyuw@linux.intel.com, zhi.a.wang@intel.com, intel-gvt-dev@lists.freedesktop.org, intel-gfx@lists.freedesktop.org, jani.nikula@linux.intel.com, joonas.lahtinen@linux.intel.com, rodrigo.vivi@intel.com, tvrtko.ursulin@linux.intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: A252B18000D X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: uwa5r1ayz7gges94uy9t9hz4cq8h6q9e X-HE-Tag: 1701528649-205779 X-HE-Meta: U2FsdGVkX1+NcnxG3ixOAF76yWGiAIMAO+pIU8tqPMfXfM5qCcND1YvumhbxNzKP4NOlBszbV7ta2ecz/Neud4LbpjjFVDVTHCuJYnEqHx/Cg46toIc3Zoa0cx5qbA5Qx3BmVpHNTgWTuRTJfkKvd4yzy46BmcUL+dHhW+A65woh1g2x7EyPt8CnAX1LwlsTpO/ePN/IEeOD/pBO2zCRSdlvx0VqC2uYtT0NdOKuNO3hHtbU+tSYPaTvgHCEtfvFmDNP2OBSyieCINlHgnK+0biVsfu/5COKG96T+5VXVCRScP4apMU3UU2esnSPS1DBXkijpiM7F3pwoe7vG2J5deRS/uGAI1lf2Wf/s9rGzs+n2Rp1nQQl7+TXQ2UF5Sv+/Kc6ZYM+7a+jKY0kThyrWxX9iwgv8eWvpSsIxusEOOG+W5HYbcHQvo3jhuPePaeal8HYTnGB66x657NDfQZ84HHjAp8GKzVqX5xLYgn2nUlSe6fbFOyOK9hDJVrbgpAn7mXwvXieAq0xJPU5S3TfC7IBc7VF3p5/X26YMxamwujQIH/PC9Yj2qWlWOCsL3dYSKM6hCabiZ8RKP6U3gt332YzEZzjyp8It9r+KpZIKIk26DCZTX3G1d2BMLixG5/B5ez7OEthjOjiDPgbyvzaBTPjt9GmfwQJgQ9csXDJTgpQMYIL8SO7YmaOXE46dFDIUyTkRSvj8qDVOMXZSCTz6os2+WiFB1ozahpZeJovawh8FRUH+fWTj8LtaQRFGjn5YOAsVFTrJojNkV9j4j6QcwFMOFhCmh+wstODIJzsXm1e5Bfr8mRIwLzbzHUiNf/GNRpYJADbtkPvQ3TRETKwyVWM65+IQv4Rkc1yAMfinT8F2uzPwyzd4WC5svLaPWfnDLFXRHLHfmTr6jrq2mdsMOZNfEcH+0FBNyuwPCKZCxFkQVSoxv2cmTALVCg/zsvVpJdToi/Sn0BMFuswrS9 Gb9PpegD 0Iiq3avEMOtLcxibNrv1TKB05d9kfb10V5hHbesgYBSyMkyJtKoiJ4kMEK6oze2ZGh4SHIvPk/dSiqgxgd+Ia5YUq4X5X8OZgLCF5daPTMTFoeElFS44vWFOq4y98/SpzcTustJdqCwaGywuQu1TTB/8u6UmBOtOD/5cLJCWmtBUgZDeq5063wZX5T0fdeoZXjzUnoB7Ckua/Q0OTf8WDEMrc73HBSgXr+QCgUiEgl8DuaJKZdxmYl7duIlicVzYLusNRcOFFiln0uk63k4iciWS1omoRuk/qu/NkmwBWBl1NlVz2970/LnXz2sQ8TVXL/oAfAbdHQi+sdLga0S6k1MAxSdUxayIOfVAPDF6KpP8fEoNwg/b9+cYke+D21/g2BEkZFDBM4MstIrxmO7huEb7Tp6HV1Ybn+em7ciEiSPGNiCrREt5FGnGCee4jf5qSgUHaSfqE67nBNfTdS7mlA/R8IQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Dec 1, 2023 at 9:23=E2=80=AFAM David Hildenbrand = wrote: > > On 28.11.23 13:50, Weixi Zhu wrote: > > This patch adds an abstraction layer, struct vm_object, that maintains > > per-process virtual-to-physical mapping status stored in struct gm_mapp= ing. > > For example, a virtual page may be mapped to a CPU physical page or to = a > > device physical page. Struct vm_object effectively maintains an > > arch-independent page table, which is defined as a "logical page table"= . > > While arch-dependent page table used by a real MMU is named a "physical > > page table". The logical page table is useful if Linux core MM is exten= ded > > to handle a unified virtual address space with external accelerators us= ing > > customized MMUs. > > Which raises the question why we are dealing with anonymous memory at > all? Why not go for shmem if you are already only special-casing VMAs > with a MMAP flag right now? > > That would maybe avoid having to introduce controversial BSD design > concepts into Linux, that feel like going a step backwards in time to me > and adding *more* MM complexity. > > > > > In this patch, struct vm_object utilizes a radix > > tree (xarray) to track where a virtual page is mapped to. This adds ext= ra > > memory consumption from xarray, but provides a nice abstraction to isol= ate > > mapping status from the machine-dependent layer (PTEs). Besides support= ing > > accelerators with external MMUs, struct vm_object is planned to further > > union with i_pages in struct address_mapping for file-backed memory. > > A file already has a tree structure (pagecache) to manage the pages that > are theoretically mapped. It's easy to translate from a VMA to a page > inside that tree structure that is currently not present in page tables. > > Why the need for that tree structure if you can just remove anon memory > from the picture? > > > > > The idea of struct vm_object is originated from FreeBSD VM design, whic= h > > provides a unified abstraction for anonymous memory, file-backed memory= , > > page cache and etc[1]. > > :/ > > > Currently, Linux utilizes a set of hierarchical page walk functions to > > abstract page table manipulations of different CPU architecture. The > > problem happens when a device wants to reuse Linux MM code to manage it= s > > page table -- the device page table may not be accessible to the CPU. > > Existing solution like Linux HMM utilizes the MMU notifier mechanisms t= o > > invoke device-specific MMU functions, but relies on encoding the mappin= g > > status on the CPU page table entries. This entangles machine-independen= t > > code with machine-dependent code, and also brings unnecessary restricti= ons. > > Why? we have primitives to walk arch page tables in a non-arch specific > fashion and are using them all over the place. > > We even have various mechanisms to map something into the page tables > and get the CPU to fault on it, as if it is inaccessible (PROT_NONE as > used for NUMA balancing, fake swap entries). > > > The PTE size and format vary arch by arch, which harms the extensibilit= y. > > Not really. > > We might have some features limited to some architectures because of the > lack of PTE bits. And usually the problem is that people don't care > enough about enabling these features on older architectures. > > If we ever *really* need more space for sw-defined data, it would be > possible to allocate auxiliary data for page tables only where required > (where the features apply), instead of crafting a completely new, > auxiliary datastructure with it's own locking. > > So far it was not required to enable the feature we need on the > architectures we care about. > > > > > [1] https://docs.freebsd.org/en/articles/vm-design/ > > In the cover letter you have: > > "The future plan of logical page table is to provide a generic > abstraction layer that support common anonymous memory (I am looking at > you, transparent huge pages) and file-backed memory." > > Which I doubt will happen; there is little interest in making anonymous > memory management slower, more serialized, and wasting more memory on > metadata. Also worth noting that: 1) Mach VM (which FreeBSD inherited, from the old BSD) vm_objects aren't quite what's being stated here, rather they are somewhat replacements for both anon_vma and address_space[1]. Very similarly to Linux, they take pages from vm_objects and map them in page tables using pmap (the big difference is anon memory, which has its bookkeeping in page tables, on Linux) 2) These vm_objects were a horrendous mistake (see CoW chaining) and FreeBSD has to go to horrendous lengths to make them tolerable. The UVM paper/dissertation (by Charles Cranor) talks about these issues at length, and 20 years later it's still true. 3) Despite Linux MM having its warts, it's probably correct to consider it a solid improvement over FreeBSD MM or NetBSD UVM And, finally, randomly tacking on core MM concepts from other systems is at best a *really weird* idea. Particularly when they aren't even what was stated! [1] If you really can't use PTEs, I don't see how you can't use file mappings and/or some vm_operations_struct workarounds, when the patch's vm_object is literally just an xarray with a different name --=20 Pedro