From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4BCCEC4332F for ; Thu, 3 Nov 2022 16:27:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D34EF6B0073; Thu, 3 Nov 2022 12:27:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CE5736B0074; Thu, 3 Nov 2022 12:27:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BACF96B0075; Thu, 3 Nov 2022 12:27:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id AE3CD6B0073 for ; Thu, 3 Nov 2022 12:27:25 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 813A8C13C0 for ; Thu, 3 Nov 2022 16:27:25 +0000 (UTC) X-FDA: 80092661250.14.5930352 Received: from mail-pf1-f169.google.com (mail-pf1-f169.google.com [209.85.210.169]) by imf26.hostedemail.com (Postfix) with ESMTP id 20A2E140008 for ; Thu, 3 Nov 2022 16:27:23 +0000 (UTC) Received: by mail-pf1-f169.google.com with SMTP id 130so2089608pfu.8 for ; Thu, 03 Nov 2022 09:27:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=Qbc9/7qEOtavwLw8LQp9XE09/9bhelNZbWmBfVDGw/4=; b=nRKfJ4QQ3+p6c+xIQjyLkoMwJvMnwYkAuaPcWumPlxJnVKcvToJsNiOAGYenmja0Sy 0RoJkXdIm/TX+G8gnDZ+obkXWdspQdCFRI8jyQ0Jbs1WkCbaBSCKMFWUKWu6G3aYiNf+ UQS/IMoncWgX6OOaPRMaCTiHd4UobAg5+BLo68MES2g/g2g52drNelqM/o5S6rL276vQ 59B9YcfdRSBbO8RH0qvr4sgiqXAKyJqC46UCoo6fS3cz6SnqinDkZiU40iLNXr90+JsU JTENrIEpqr75AzmQXB9iCQJUqbZTKhHGlXw36gDu1kRfRLSYiBKX6y1lz/w+ys/ieg8s 2YFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Qbc9/7qEOtavwLw8LQp9XE09/9bhelNZbWmBfVDGw/4=; b=ZSJK6G5R1fVthf/xd62lCDIA2aUAoe7qDhI1BXSJuffMaZ6Sus0WI7uTzCL/6m/KFV LBFuRI0wktjDMqUCxDg5GD6c+FnZamr5tmhkf78usaCtqeZOxwn/aIBQw5mfehK74ihb sxSHR4S+3EQCzRlZma3hx9a4csK1vSTvJFcNB6QYUzT3FwOlD95iDbvQs8rI2/sSJ8xG kZihmd5NTAlPvY9YjqBmeuvl7W0ckP4mOJQDNqLZk1FGOY++REpmGUgWG660vkWK7dmV Z2BKAKM+dpXAT9wCpLSKUREtGZ1quLgip4Tp/K3Vac8BQsbIvz6pOEnxrU+SYkfyI3sW Wt6w== X-Gm-Message-State: ACrzQf3J5hzITnz8llAUPmAEZWQUbiDnVOkxnO2J9gWL65/V8mtrwiai 5sifuINr4dWC07g40cZyRx+Qh9+157TFgQ3DcJYDaMC9eXszDw== X-Google-Smtp-Source: AMsMyM7LsxnGFqh+M0E5h18/S54Hm40LIPJDB0peNXRc06g8NnvtmJDyMzFqlwvdks3WF/JqPmHtsXk2TpzQtF/OoJ8= X-Received: by 2002:a63:c4c:0:b0:46f:e243:503a with SMTP id 12-20020a630c4c000000b0046fe243503amr14314192pgm.483.1667492842447; Thu, 03 Nov 2022 09:27:22 -0700 (PDT) MIME-Version: 1.0 References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> <20220915142913.2213336-2-chao.p.peng@linux.intel.com> <20221021134711.GA3607894@chaop.bj.intel.com> <20221024145928.66uehsokp7bpa2st@box.shutemov.name> In-Reply-To: <20221024145928.66uehsokp7bpa2st@box.shutemov.name> From: Vishal Annapurve Date: Thu, 3 Nov 2022 21:57:11 +0530 Message-ID: Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd To: "Kirill A . Shutemov" Cc: Sean Christopherson , Chao Peng , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Yu Zhang , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song , wei.w.wang@intel.com Content-Type: text/plain; charset="UTF-8" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1667492844; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Qbc9/7qEOtavwLw8LQp9XE09/9bhelNZbWmBfVDGw/4=; b=jzjlG/v5ZYT9/J6BcUGlNOjdSs36hjyGyhkJ3sZvxlF0MoNq+3vST7YcbPSiNKTN6P3fvt jrRCSyI/n3iPgjsPcj8eje9j4e0TjEYobEVGFOiG9LeOcNjrw1579o8P013Vn5PMm3NB1j y1LxuSqq4/0tBTjj8yg+ecxN0foQ6oU= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=nRKfJ4QQ; spf=pass (imf26.hostedemail.com: domain of vannapurve@google.com designates 209.85.210.169 as permitted sender) smtp.mailfrom=vannapurve@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1667492844; a=rsa-sha256; cv=none; b=m8iqwp8igIkiL6Ww+//hFFxGEVMorq1TY0cGqbYbGzR4YS1nJRxqyl+Y0YwgH6tIBLOom+ Juyu2zWgsM1ez3NJLChjRbZPguUmkRac7hPXv9hBkXpcsX7Wsx6DFqyO52ZkpM1jJGvmY+ 1gS0TerJ9uL5zYV8rCVymgKdceTfrrA= X-Stat-Signature: mx5aqpey7hbc4cha15mwrrgqe5gm9wyy X-Rspamd-Queue-Id: 20A2E140008 X-Rspamd-Server: rspam06 X-Rspam-User: Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=nRKfJ4QQ; spf=pass (imf26.hostedemail.com: domain of vannapurve@google.com designates 209.85.210.169 as permitted sender) smtp.mailfrom=vannapurve@google.com; dmarc=pass (policy=reject) header.from=google.com X-HE-Tag: 1667492843-634441 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Oct 24, 2022 at 8:30 PM Kirill A . Shutemov wrote: > > On Fri, Oct 21, 2022 at 04:18:14PM +0000, Sean Christopherson wrote: > > On Fri, Oct 21, 2022, Chao Peng wrote: > > > > > > > > In the context of userspace inaccessible memfd, what would be a > > > > suggested way to enforce NUMA memory policy for physical memory > > > > allocation? mbind[1] won't work here in absence of virtual address > > > > range. > > > > > > How about set_mempolicy(): > > > https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html > > > > Andy Lutomirski brought this up in an off-list discussion way back when the whole > > private-fd thing was first being proposed. > > > > : The current Linux NUMA APIs (mbind, move_pages) work on virtual addresses. If > > : we want to support them for TDX private memory, we either need TDX private > > : memory to have an HVA or we need file-based equivalents. Arguably we should add > > : fmove_pages and fbind syscalls anyway, since the current API is quite awkward > > : even for tools like numactl. > > Yeah, we definitely have gaps in API wrt NUMA, but I don't think it be > addressed in the initial submission. > > BTW, it is not regression comparing to old KVM slots, if the memory is > backed by memfd or other file: > > MBIND(2) > The specified policy will be ignored for any MAP_SHARED mappings in the > specified memory range. Rather the pages will be allocated according to > the memory policy of the thread that caused the page to be allocated. > Again, this may not be the thread that called mbind(). > > It is not clear how to define fbind(2) semantics, considering that multiple > processes may compete for the same region of page cache. > > Should it be per-inode or per-fd? Or maybe per-range in inode/fd? > David's analysis on mempolicy with shmem seems to be right. set_policy on virtual address range does seem to change the shared policy for the inode irrespective of the mapping type. Maybe having a way to set numa policy per-range in the inode would be at par with what we can do today via mbind on virtual address ranges. > fmove_pages(2) should be relatively straight forward, since it is > best-effort and does not guarantee that the page will note be moved > somewhare else just after return from the syscall. > > -- > Kiryl Shutsemau / Kirill A. Shutemov