From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0315CCF318A for ; Tue, 1 Oct 2024 23:00:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 835CD440149; Tue, 1 Oct 2024 19:00:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7BD0768002B; Tue, 1 Oct 2024 19:00:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5E8F4440149; Tue, 1 Oct 2024 19:00:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 37EBF68002B for ; Tue, 1 Oct 2024 19:00:44 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id D096D1406A3 for ; Tue, 1 Oct 2024 23:00:43 +0000 (UTC) X-FDA: 82626554766.29.686DAA9 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) by imf02.hostedemail.com (Postfix) with ESMTP id DCC6B8000F for ; Tue, 1 Oct 2024 23:00:41 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=SSij8Gju; spf=pass (imf02.hostedemail.com: domain of 3GH_8ZgsKCEcjltn0un72wppxxpun.lxvurw36-vvt4jlt.x0p@flex--ackerleytng.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=3GH_8ZgsKCEcjltn0un72wppxxpun.lxvurw36-vvt4jlt.x0p@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727823576; a=rsa-sha256; cv=none; b=fuuDVKk9OAbZW2aHB7U7I2IiWCYoV6psvhDcIRJRw9FyZmBTtwRPdqGLaKVPTvxEsaSGyy WXKLduilIiKsUIMf5WMBR3z9M6GCqpXWvM8r2igNl/3kSy2p6SvZl32hmTIksgiGXyUVNL mxciJvweUDUha4uvGT733snaLXaY3Ic= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=SSij8Gju; spf=pass (imf02.hostedemail.com: domain of 3GH_8ZgsKCEcjltn0un72wppxxpun.lxvurw36-vvt4jlt.x0p@flex--ackerleytng.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=3GH_8ZgsKCEcjltn0un72wppxxpun.lxvurw36-vvt4jlt.x0p@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727823576; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:dkim-signature; bh=4gCtU5HfgnWFhWLB0iKDEWMiCCXFJXlISNEaZ8KzdwM=; b=D2L0cQa+EybnmuvAsEAjWuRx2a3mdiBR4WTlMpJcNhQldSPudJbcpXLv0xsv7jaidWSgrJ 9BTV1xmOhetFTO/U4a6uGMIXIHiY/ABn3N5I0VEuVZEUXg6ORMyOC5mgbKZNsdho8tkYID 4Nc2vWwO/bZUqVylNvMe2C8KoC2hI/s= Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6e25a6fdde8so52592167b3.2 for ; Tue, 01 Oct 2024 16:00:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1727823641; x=1728428441; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=4gCtU5HfgnWFhWLB0iKDEWMiCCXFJXlISNEaZ8KzdwM=; b=SSij8GjuEBP/8dN8NVmHthk57/NSUweeQGgH2MVxDNfpFL870Y8E7Ls69qg8XlXFZ+ O8dQcEUiNlwGhBBMmqWdHWLU5/weuC71VjaIFnXVHLHEqVqJbWhYu2eLmuSbVGNtRkBy ao/Gf2CsQN6WmefJ7uFyHEaP7CiCMzdo7pm+xerHtSxO+zsblABRHVAMwVCsxx/AgQgm 8q8vrHqETvRl73s3HDZvhoLbu0GmtGOlCkWtRr3CMo+8zKq7lrlxMDP7hb+f4spHgJM3 8Sy7zhZc+ydcldMUaL7LaU0Me76iB/dgKluEC74AGIt42zfzexrEaXTuBvsfuXCd+BRC U+UA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727823641; x=1728428441; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=4gCtU5HfgnWFhWLB0iKDEWMiCCXFJXlISNEaZ8KzdwM=; b=bkAI9rxZIXK4Ijm6aNEmllEUpwPjvJ09PDQIsGMUIA2n7xcwz/9fGf7PqZFlB55qPu PGUue2RX1b+n9XkM1y0dZLFYBkcmUkc2G54yEwDy5YxvG7E7+VCYky+dkjnfCj7HM2/E eeuop5ERMvJTw4lxzZQk9bPExaObnmol701jl/CIIeZ3pKtIZOUyUsexmMp96GUlkigT IPEopthy/kfUewPn3zbs4y8T0/y0QSC4p1lVjGmDO1U8vuyZxueVYkorLK/KmolkRE3l 6qjWtxgtQ7mzYfIceyB2g6W5Gp3ihVGM6SeHuLFo8tvl9mJWPzjoM2chLSuUypcsqvt2 OUtg== X-Forwarded-Encrypted: i=1; AJvYcCWOrqPOfbYiLMtXqRRtDLBrTyZJGNft1Yzm3v5zW9lab3I57vBYLC7I9sTrLzq9do6JtbnKUL3G0w==@kvack.org X-Gm-Message-State: AOJu0Yzgfc1m6fYQNpAP37gxbGgGMj02I4I/PpwaT3i6uEp8rjCTNHPV lwUgDrJ6b2SJ1YnDW/RFikKPu/HDYKrp6TZVFfIclrrLv1n/O4Q5M5Zs3S/Z8Y+HToh0BCxg1x8 4VApBQbq2hf2kdwbht7rmyw== X-Google-Smtp-Source: AGHT+IG3j6THonxAJt6TEJKN7NO5HT8bgSkq/EkXQdPd0/DIlLt9YIjJDCnSMc65fW4q5aqiWvBO7ZRe6vvgfxZcVw== X-Received: from ackerleytng-ctop.c.googlers.com ([fda3:e722:ac3:cc00:146:b875:ac13:a9fc]) (user=ackerleytng job=sendgmr) by 2002:a05:690c:6001:b0:6e2:12e5:356f with SMTP id 00721157ae682-6e2a2e48277mr357187b3.3.1727823640635; Tue, 01 Oct 2024 16:00:40 -0700 (PDT) Date: Tue, 01 Oct 2024 23:00:38 +0000 In-Reply-To: (message from Vishal Annapurve on Fri, 20 Sep 2024 11:17:45 +0200) Mime-Version: 1.0 Message-ID: Subject: Re: [RFC PATCH 14/39] KVM: guest_memfd: hugetlb: initialization and cleanup From: Ackerley Tng To: Vishal Annapurve Cc: tabba@google.com, quic_eberman@quicinc.com, roypat@amazon.co.uk, jgg@nvidia.com, peterx@redhat.com, david@redhat.com, rientjes@google.com, fvdl@google.com, jthoughton@google.com, seanjc@google.com, pbonzini@redhat.com, zhiquan1.li@intel.com, fan.du@intel.com, jun.miao@intel.com, isaku.yamahata@intel.com, muchun.song@linux.dev, mike.kravetz@oracle.com, erdemaktas@google.com, qperret@google.com, jhubbard@nvidia.com, willy@infradead.org, shuah@kernel.org, brauner@kernel.org, bfoster@redhat.com, kent.overstreet@linux.dev, pvorel@suse.cz, rppt@kernel.org, richard.weiyang@gmail.com, anup@brainfault.org, haibo1.xu@intel.com, ajones@ventanamicro.com, vkuznets@redhat.com, maciej.wieczor-retman@intel.com, pgonda@google.com, oliver.upton@linux.dev, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kvm@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-fsdevel@kvack.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: dsdpj4fs57swttwq4ksh3jsxbxpzg1e6 X-Rspamd-Queue-Id: DCC6B8000F X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1727823641-921248 X-HE-Meta: U2FsdGVkX1/7dDjR8+GxuHyzCmexfpFdn37WguLLH1qW1Vprknp2vnfQ35yEfxASAnTC4GCZM+qbsgCfx+B/Z0+Hq4p6/kX8RaQY4IzMy3kZvJAw6GJ/vkt+dz6BO5N+XcrybeX554t84xKTAypgAQxiwj1NL8cwsW5wO46+yUk1/yCW7yK1oK8pebxeWZINLwAQXvrexD5OEKCbuWhlzGRsFZwxyb4gw64Vi+s1NSl2FHh8mNC3hEDZFWwWObK/+bu5mFCRBE8YSJGtH072FVpan3+FQyEiqikuwLLt1kaSEWbgg+rQBChVSLOkfvjF75h7+quMNxvl/nGK1Fn81In53FwxLi1lSHrAMeb8P8N3wK6XYEbxj1lAccSkXmeVd8hB8K4ME9dPM8GK1XefJHAGeUHF5phw/GFQXrRs/lHQJYSFGMmjnh0bYBvcnFvp2hvr4S/s5vGI0AaIL59AQkku251bhcaqBaNvBc9ff4ZXD16eKVpjT2DO5LGhfhWQ6UBpw3erH5+NmfwQfzKJdwtM1VisPdEI/SAjxomeMUL1NCWwR0F+Vail74sD8Xn2IJtcnmKi/Ni3LNgx96Ab4EQzb2BWgQ0+a3CJYcA+G1eQzVj9L1N+JCg8w8J2v0U4NOyx4xH3dN2p4KL4IUz6gmp1JJ43e6n3r+KgYIixW0wJaz/PLh0XUNTHumlhDVUH6IL0Xe5nKIMragRCtRdtOreex4E8EN/QGGc3DcMJeJgwRpX8m1KMfGZG1gMSmnjuiQaVOFlnxznAzv8htO3vuVjT1/SsqXh4F+JLwEsvkHqotKEVR118eYPbfQp0P9yiAR9S7DPB0xVkszDGyrMXb5BTQI4buaSiEeLhl0u3V65f2O33n21I9MNv33/DqU6u9PnYT0gsUv/TzmEEzFSLx9MigNaNC2HIMTnHRjPezdZmt+RGkvCIpB7Sk8Jq96YnOg38akELrvBHT0uCOVU K5ENG7tz HWdngWKKxQ43GP+6cR7GkD9L9XrzsNVcMLAAYeLIRD1NXmPxW+I/6iPujM3bGWgxI+yulSmGfy98sVXleyg1l5/1+jtV7gpluPEu816xUOcGVkyubW5iwDCC/sdt/5cHvuVbcqoQnFy5wUAOOtlwu4p7xUQv+YZPabjqFIR3bXvJ3CZOcp9X4jXMF96V+Vkps95NgMiKy74A9ZDOiJxGwG4PEGsRgjJTEvHZcWTfGsnothB254DAAhB1GxM5RLHnuQ3afHfJATSud+Sf5BeNpEL64DZjCEQOoXOQIqfwQdl3CTxTLXDBoDj7B/jHwix/4zTErlyOU+FjPypxX26INSLd6VUTyh+4/TsJKOI2F0z1GDGKpTK0OiWc53QIr7oaO4WbpQi/f8Nq0ZYb+0P6EK2B6keJ/uDklSkU23ADZ0zQ62Ai/OwMtWSl4gYRedop8jXJhKt1X4uOtArm9aRTQAIysvUeNso5SJIVn4xRTj88BxRjDdVYL9Tn6uJl5HLIROqQjrZgYewnSWMqO8GM3Nw0QJfPl5FyWk+ci1M+8DgmhQmTAew6KYq1xX3zI+etEVvI8tNqUfLtfMmwVTKK9BYjKJK0rtSUqiekg56hjpUTGc2HiJDQvNxfmAovSRgFdE+ukKOcWcpNzSV0VI7Bc3B+UPlwMdH5/El6BMeawj49MlwaFqSLQM53R7vC5VccGCGq/GyiGTR5att5j2lIo8GWdiUnDM9+ViPjpLidgaL4mKdY+6b8U8s7XprNHrTdcARGniu/FRBdF/pllZx91z9QUhjMoUccGauRg X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Vishal Annapurve writes: > On Wed, Sep 11, 2024 at 1:44=E2=80=AFAM Ackerley Tng wrote: >> >> ... >> +} >> + >> +static void kvm_gmem_evict_inode(struct inode *inode) >> +{ >> + u64 flags =3D (u64)inode->i_private; >> + >> + if (flags & KVM_GUEST_MEMFD_HUGETLB) >> + kvm_gmem_hugetlb_teardown(inode); >> + else >> + truncate_inode_pages_final(inode->i_mapping); >> + >> + clear_inode(inode); >> +} >> + >> static const struct super_operations kvm_gmem_super_operations =3D { >> .statfs =3D simple_statfs, >> + .evict_inode =3D kvm_gmem_evict_inode, > > Ackerley, can we use free_inode[1] callback to free any special > metadata associated with the inode instead of relying on > super_operations? > > [1] https://elixir.bootlin.com/linux/v6.11/source/include/linux/fs.h#L719 > .free_inode() is not a direct replacement for .evict_inode(). If the .free_inode() op is NULL, free_inode_nonrcu(inode) handles freeing t= he struct inode itself. Hence, the .free_inode() op is meant for freeing the i= node struct. .free_inode() should undo what .alloc_inode() does. There's more information about the ops free_inode() here https://docs.kernel.org/filesystems/porting.html, specifically | Rules for inode destruction: | | + if ->destroy_inode() is non-NULL, it gets called | + if ->free_inode() is non-NULL, it gets scheduled by call_rcu() | + combination of NULL ->destroy_inode and NULL ->free_inode is treated as | NULL/free_inode_nonrcu, to preserve the compatibility. The common setup is to have a larger containing struct containing a struct inode, and the .free_inode() op will then free the larger struct. In our ca= se, we're not using a containing struct for the metadata, so .free_inode() isn'= t the appropriate op. I think this question might be related to Sean's question at LPC about whet= her it is necessary for guest_memfd to have its own mount, as opposed to using = the anon_inode_mnt. I believe having its own mount is the correct approach, my reasoning is as follows 1. We want to clean up these inode metadata when the last reference to the = inode is dropped 2. That means using some op on the iput_final() path. 3. All the ops on the iput_final() path are in struct super_operations, whi= ch is part of struct super_block 4. struct super_block should be used together with a mount Hence, I think it is correct to have a guest_memfd mount. I guess it might = be possible to have a static super_block without a mount, but that seems hacky= and brittle, and I'm not aware of any precedent for a static super_block. Sean, what are your concerns with having a guest_memfd mount? Comparing the callbacks along the iput_final() path, we have these: + .drop_inode() determines whether to evict the inode, so that's not the approprate op. + .evict_inode() is the current proposal, which is a place where the inode'= s fields are cleaned up. HugeTLB uses this to clean up resv_map, which it a= lso stores in inode->i_mapping->i_private_data. + .destroy_inode() should clean up inode allocation if inode allocation inv= olves a containing struct (like shmem_inode_info). Shmem uses this to clean up = a struct shared_policy, which we will eventually need to store as well. + .free_inode() is the rcu-delayed part that completes inode cleanup. Using .free_inode() implies using a containing struct to follow the convention. Between putting metadata in a containing struct and using inode->i_mapping->i_private_data, I think using inode->i_mapping->i_private= _data is less complex since it avoids needing a custom .alloc_inode() op. Other than using inode->i_mapping->i_private_data, there's the option of combining the metadata with guest_memfd flags, and storing everything in inode->i_private. Because inode->i_mapping actually points to inode->i_data and i_data is a part of the inode (not a pointer), .evict_inode() is still the op to use to clean both inode->i_mapping->i_private_data and inode->i_private. I think we should stick with storing metadata (faultability xarray and huge= tlb pool reference) in inode->i_mapping->i_private_data because both of these a= re properties of the page cache/filemap. When we need to store a memory policy, we might want to use .destroy_inode(= ) to align with shmem. What do you all think? And there's no way to set inode->free_inode directly and skip copying from inode->i_sb->s_op. All the code paths going to i_callback() copy inode->i_sb->s_op->free_inode to inode->free_inode before calling .free_inode() in i_callback() to complete the inode cleanup.