From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9E19ECA0FE1 for ; Fri, 1 Sep 2023 03:46:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 77C768D0012; Thu, 31 Aug 2023 23:46:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 72C108D0002; Thu, 31 Aug 2023 23:46:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5F36A8D0012; Thu, 31 Aug 2023 23:46:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 502748D0002 for ; Thu, 31 Aug 2023 23:46:00 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 09353B2F8D for ; Fri, 1 Sep 2023 03:46:00 +0000 (UTC) X-FDA: 81186640080.21.4358A86 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.151]) by imf07.hostedemail.com (Postfix) with ESMTP id 47E4B4000D for ; Fri, 1 Sep 2023 03:45:57 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=hpJgxiSQ; spf=none (imf07.hostedemail.com: domain of binbin.wu@linux.intel.com has no SPF policy when checking 192.55.52.151) smtp.mailfrom=binbin.wu@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1693539958; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=k2OTCIzLU2olDgRXUfjcOp316wQPbWV5SDtctTBmqFM=; b=iqV+h88uzEHl5u07kABxUuFGkkeLaOvZFau7p5vvp+/htDyHg2YzNYOctvMxA/+XsiZtS7 V/bDKCP2R87gF3mT0b5GHb46wbK4Hm6OM5jvLrNJS/3f4t4qfNryVwgGzinNGatLTPFkDH rkVRFN704u0vi9ciw6prCOxyrMwDt/g= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1693539958; a=rsa-sha256; cv=none; b=Nr9Pz7VU4CR5RrYFvm3cBC+Hkirrt7ekgfVKIIfy+jnA4szmhQLH9TNilZup2iycRW5iJM yeneSmvjJs5vuldg8QE6xi8mOu88iZKovY9nUglsm+e1r0a4JkOcUfgvkQQfMEjSvdlL2s tTEy3L0ac6Q4ITp3yCwuR0OHbnhZKpk= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=hpJgxiSQ; spf=none (imf07.hostedemail.com: domain of binbin.wu@linux.intel.com has no SPF policy when checking 192.55.52.151) smtp.mailfrom=binbin.wu@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1693539956; x=1725075956; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=dQBtg8NKz/pLl187/eYnyafMQC+Yh2bQyqEr25y1Bps=; b=hpJgxiSQe+SevH2X3SPAjIyv+XqL4SdT5t6Ek7QnLB4DHQbg9RbnDkHn qauEvchAG5IViXF9Fm28p7qA/xK1SX1Z+deAwu/1NOfU6/1tTVXdMYdIt V2+8SGpUZFJA3KsUibaMMvVuqJgoEyio7Yco5XcKziykplj2prp184EfJ I7ADStMxAOTgmUvSU6ISoXkLQQU0Mb368gax1GGpKl++wJl4kO03h0FGr 2dDfixsSKKbDn6i3YFkUAH1T+tmLYlcgMm4a8iijLcfpu59v0npDanQSG P9bI9rzV8U/yzMkk2d7MfjeoOVM1sU1JyVbxqLPhBVnqcQZxLXw/cZhIT A==; X-IronPort-AV: E=McAfee;i="6600,9927,10819"; a="356441909" X-IronPort-AV: E=Sophos;i="6.02,218,1688454000"; d="scan'208";a="356441909" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Aug 2023 20:45:54 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10819"; a="742962743" X-IronPort-AV: E=Sophos;i="6.02,218,1688454000"; d="scan'208";a="742962743" Received: from binbinwu-mobl.ccr.corp.intel.com (HELO [10.93.2.44]) ([10.93.2.44]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Aug 2023 20:45:45 -0700 Message-ID: Date: Fri, 1 Sep 2023 11:45:43 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.15.0 Subject: Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory To: Ackerley Tng Cc: seanjc@google.com, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com, maz@kernel.org, oliver.upton@linux.dev, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, willy@infradead.org, akpm@linux-foundation.org, paul@paul-moore.com, jmorris@namei.org, serge@hallyn.com, chao.p.peng@linux.intel.com, tabba@google.com, jarkko@kernel.org, yu.c.zhang@linux.intel.com, vannapurve@google.com, mail@maciej.szmigiero.name, vbabka@suse.cz, david@redhat.com, qperret@google.com, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com References: From: Binbin Wu In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 47E4B4000D X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: yo39zurpz7ijcg8qggetfcohgssb7sks X-HE-Tag: 1693539957-588627 X-HE-Meta: U2FsdGVkX18eOLwGKqUEq8Miv6KGibU5O0LReTmK27Gj7IFCTIQWvZZBMaTqQoJXlC9+L98+7K/ObPnkWLtINf7G6+7sT7/fpnRFNN8vAIjLufumP4eF9meQGxR/cRpL/UUwhj8siaRxAl9hjAcCZVet3m3CXi49NZD9UQ1eUahC65Y3Bv2l3S2ZnGHogPecZz+Q6m68xj7q5lilRNupvJJ+61lxWassE1vFgMTSsofW/+e/1UUPSWSdr8VDp9oFhfOyH33lvt5d3HUeOKQiC9j3EjUr/zGRJK5zNxJsKGDTeuBHc+gCw6i5y5vL7oI8s/1JuLMMnmgposgjGTuqtw150kT6/Wm5J5T/5VirrSvwX7Ec9Xl+H1Tlsb+o3ZH6QmyrP8wqD/KdEks5LhhTPhtauxbJ3HONwNENeHeRX3MJDhgM0bTuJjCRWatzsctPfrObKxsjmy2OBVCkoUtLNPmO7KeegvPwWvPUlUy//gyI3UmROEUCoE5WspC0g0Jn6NKMzWFzgzSqS0wjo+3ganA062Zek5F/yT7Ol71ciOzAVOe62hF6/XQEaVPE47eTnELWwOTnYJDsLqSmjuA5uxaPfkjKJJ5671eddjHHL8QfE7/DD0at7chb1+gjEawdUHpCuJ1Oed0qNnjsWNHMf149lvAgCpcl8U1BFNE8u4EmBd4HcGawDrqKvNxzIG+thh4kPov1QuRM+zuO8uWCBsxb7lyw++WywEr86KwmL0BCtefQLKGzsQFEJPPM9COpaga5m5dnI0Eu4qwHcvbwgwVcZemOKiRDtEJsDxyZYv0mX5DXCJTKZYzhva4Lv+mwxNzed1BHLH7FU5PWC8HGADOTw7O2F2qlvOLlRmQGbZkaJpq3DxMPEflKQ/MiEnBV6roMz2m+eNBPROKahPHx+RMYiWxuc7bkfHXniThHj53YwRBB3NJ6+r1ZJvI0ZrWqh1ABa+xkzEOav6QysFm 8owpVu20 CoPyhAE/hxSmlJ9lNjTsbDswAWyOoPW81j5z5W7o295wIqAYHyeT4keyltVpOJEDtN8AxmW9DA9PhKxPDcYain4IXmtXS6xy9ggr0O5rbG83dh3h1Dz+cuYqRaT4aEhGh4XHjNZUzYOQvxk8+JjcU0EI+VsqT/ZXSDJ4JxO1wRpW7SRc8d59IVzlwkqh142Mza1csO/BXivGQO04BvC/T3+QsiTYQKkw0EFBA8ksaDKZNN1a60YF8WQddUsHwNQr+zIwTYDhh0adH8RA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 8/31/2023 12:44 AM, Ackerley Tng wrote: > Binbin Wu writes: > >>> >>> >>> +static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len) >>> +{ >>> + struct address_space *mapping = inode->i_mapping; >>> + pgoff_t start, index, end; >>> + int r; >>> + >>> + /* Dedicated guest is immutable by default. */ >>> + if (offset + len > i_size_read(inode)) >>> + return -EINVAL; >>> + >>> + filemap_invalidate_lock_shared(mapping); >>> + >>> + start = offset >> PAGE_SHIFT; >>> + end = (offset + len) >> PAGE_SHIFT; >>> + >>> + r = 0; >>> + for (index = start; index < end; ) { >>> + struct folio *folio; >>> + >>> + if (signal_pending(current)) { >>> + r = -EINTR; >>> + break; >>> + } >>> + >>> + folio = kvm_gmem_get_folio(inode, index); >>> + if (!folio) { >>> + r = -ENOMEM; >>> + break; >>> + } >>> + >>> + index = folio_next_index(folio); >>> + >>> + folio_unlock(folio); >>> + folio_put(folio); >> May be a dumb question, why we get the folio and then put it immediately? >> Will it make the folio be released back to the page allocator? >> > I was wondering this too, but it is correct. > > In filemap_grab_folio(), the refcount is incremented in three places: > > + When the folio is created in filemap_alloc_folio(), it is given a > refcount of 1 in > > filemap_alloc_folio() -> folio_alloc() -> __folio_alloc_node() -> > __folio_alloc() -> __alloc_pages() -> get_page_from_freelist() -> > prep_new_page() -> post_alloc_hook() -> set_page_refcounted() > > + Then, in filemap_add_folio(), the refcount is incremented twice: > > + The first is from the filemap (1 refcount per page if this is a > hugepage): > > filemap_add_folio() -> __filemap_add_folio() -> folio_ref_add() > > + The second is a refcount from the lru list > > filemap_add_folio() -> folio_add_lru() -> folio_get() -> > folio_ref_inc() > > In the other path, if the folio exists in the page cache (filemap), the > refcount is also incremented through > > filemap_grab_folio() -> __filemap_get_folio() -> filemap_get_entry() > -> folio_try_get_rcu() > > I believe all the branches in kvm_gmem_get_folio() are taking a refcount > on the folio while the kernel does some work on the folio like clearing > the folio in clear_highpage() or getting the next index, and then when > done, the kernel does folio_put(). > > This pattern is also used in shmem and hugetlb. :) Thanks for your explanation. It helps a lot. > > I'm not sure whose refcount the folio_put() in kvm_gmem_allocate() is > dropping though: > > + The refcount for the filemap depends on whether this is a hugepage or > not, but folio_put() strictly drops a refcount of 1. > + The refcount for the lru list is just 1, but doesn't the page still > remain in the lru list? I guess the refcount drop here is the one get on the fresh allocation. Now the filemap has grabbed the folio, so the lifecycle of the folio now is decided by the filemap/inode? > >>> + >>> + /* 64-bit only, wrapping the index should be impossible. */ >>> + if (WARN_ON_ONCE(!index)) >>> + break; >>> + >>> + cond_resched(); >>> + } >>> + >>> + filemap_invalidate_unlock_shared(mapping); >>> + >>> + return r; >>> +} >>> + >>> >>>