From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3D673CE7B1F for ; Thu, 28 Sep 2023 19:17:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8BB628D00D2; Thu, 28 Sep 2023 15:17:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8439A8D0053; Thu, 28 Sep 2023 15:17:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6E44D8D00D2; Thu, 28 Sep 2023 15:17:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 593248D0053 for ; Thu, 28 Sep 2023 15:17:01 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 19EAB1A0843 for ; Thu, 28 Sep 2023 19:17:01 +0000 (UTC) X-FDA: 81286963842.25.4464D48 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65]) by imf23.hostedemail.com (Postfix) with ESMTP id E17A0140017 for ; Thu, 28 Sep 2023 19:16:58 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=bMs1RvbS; spf=pass (imf23.hostedemail.com: domain of dave.hansen@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=dave.hansen@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695928619; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cAuSN1mFdkn3i7EWtocsk5jaDuwt0uV3aoEY1jQxchI=; b=JrmI8GiN2OT8tSu1+NDRMEXCnS9DxJ9Pp6y0PCw4WWQo33aAzUjv+3ggnP124v1LhZu+aG iC/LILbZKUkg1woa3vNaa1a4KH8hOw4Nc5dv+DkNoL3TozTjNX4F0P101YSt21NSsT5nTj TETrSxycoE/LHxQ2PAwPrR/OF5HBIsM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695928619; a=rsa-sha256; cv=none; b=GHIE8CWA+RBPBoYnDYE8bhKKyZ571FNhzyMzbsjDjKXMDmERrrg9tepSMjZgUKc6FP062W 0AE+ZMeoLGUP7oRsbgQpVDQkxsoIQSdLbV7j7hIPIyqFblJ1wDEYm3ERZ343EeRBv+UAT8 bEcIJNW/cfMhcp2JgsA8xsx+iv/2fAo= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=bMs1RvbS; spf=pass (imf23.hostedemail.com: domain of dave.hansen@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=dave.hansen@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695928618; x=1727464618; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=s0Bd48FJIq/CzYjG5JiO8X9ToDofegyMVutfzhDHe34=; b=bMs1RvbS8ArdqEB6qwkyhA72AfqaVd20P8PlfajH2fV9j+7RUh6bNgMc 8KGfej5p2luqUQrE0vAEKnDNwQlCsv5aBBE7NEXTkcu8ovkiyk9tNtLYR UnrHS3uVpXuwgMmv11NAE7Oy2RIJfZbGHvSiVj0ovyXEDnTqZ3kHJ44By HdluyrTG6Q/PoCAdrJcEAhMB0g16Dap77ki8kMc/VwlZ2q7KJZbVghcOd y6VKUrx8j2haPs+zqBc1TxdftXdZq6awdHucixvzzP6oVNeQD21BTDF5o slqmgOdqCYR2uHqaJv0g4oSn3jzkSNINq2wjY2QIX4HR2n2dpnu7DW/D5 g==; X-IronPort-AV: E=McAfee;i="6600,9927,10847"; a="386017748" X-IronPort-AV: E=Sophos;i="6.03,185,1694761200"; d="scan'208";a="386017748" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Sep 2023 12:16:33 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10847"; a="819954401" X-IronPort-AV: E=Sophos;i="6.03,185,1694761200"; d="scan'208";a="819954401" Received: from jveerasa-mobl.amr.corp.intel.com (HELO [10.255.231.134]) ([10.255.231.134]) by fmsmga004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Sep 2023 12:16:32 -0700 Message-ID: <340596c9-d55d-5f8a-fa27-d95b0e10b20a@intel.com> Date: Thu, 28 Sep 2023 12:16:31 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1 Subject: Re: [RFC PATCH v2 0/7] Introduce persistent memory pool Content-Language: en-US To: Stanislav Kinsburskii Cc: Baoquan He , tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, ebiederm@xmission.com, akpm@linux-foundation.org, stanislav.kinsburskii@gmail.com, corbet@lwn.net, linux-kernel@vger.kernel.org, kexec@lists.infradead.org, linux-mm@kvack.org, kys@microsoft.com, jgowans@amazon.com, wei.liu@kernel.org, arnd@arndb.de, gregkh@linuxfoundation.org, graf@amazon.de, pbonzini@redhat.com, "Shutemov, Kirill" References: <01828.123092517290700465@us-mta-156.us.mimecast.lan> <20230927161319.GA19976@skinsburskii.> <20230927232548.GA20221@skinsburskii.> <20230928000230.GA20259@skinsburskii.> <760bbb08-83b4-7bb1-822f-2ceba26278a6@intel.com> <20230928003831.GA20366@skinsburskii.> From: Dave Hansen In-Reply-To: <20230928003831.GA20366@skinsburskii.> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: E17A0140017 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: 3qrx8dfh3gdmamcnuj7cuc9spqy1fma9 X-HE-Tag: 1695928618-683504 X-HE-Meta: U2FsdGVkX19+AD2JNWR3iCMIOs7DZG7qqctXgu0szue2I/s/3RtMektlZQpI4Cl9R8ez6J0/5lpni+4e4RpH1aI2zyCqZcXqvxj2nEgLo2bD9zkTuRTdkD04vO2tBxjKpMINrBb+FqidL1yOkrwY0FA6rOOms5DaeB1epLYTZTsRdof5NFOXaY3oSb2o6vEvZTUfsuul61ZlIVhuRJg9FwteG4X2rqhrB1PVELT+RMakY0fPIYl8kCrDbbFurDfl4m4OtDO0R+KEsIxkavATfKtoGipwbMCxoLDXs0tsI4bEIaJZ2E5ld4gtMkoNr9ixDulpQWXEBG7n6E+wSfXBtVii3eCck6IUGtrXbNTtrdYwlVwRygbZSWk7/VAkmgDrFnfnYVntRQL+l94ZENs+M04eJUMdT6IM4JPM7NZzL29rYeBSt/KIrKx5UYIswWyRd3nCFW3fu/bL9sCbgzvWOMp//3erqAnigDeLbWjf6zZSp7KfDZgm2gHD1L3fTUZFFnGKgGx1xBLANiP5CWYqfiBNbEIzBYIoM7X0CLUsC2HkwZpKtpKV2JwJ1W45dOVXidIWU2jiU+j+GzM472MQ10z1RHwnWB0rUr+Y0UV0pYP+9nd96htOOrCLXX2Vv3DHBZgDlxe/9Fyp4rfT7GKkMrmcvTtJNpHQbXK98y//ll/UQh9WUJySpXyRGi8A/PQZH/agQPva8MvQtF3r8jooYUloX+2GCbjT2F5Yuus2SMeA9HyAJX/bYWDjBp4XZ2dDOpdci8EdsVrHyFr1yaxwtmTocoLUVe5/w/C5WBEWTGuzKxUg6RkROBpgUoKnEKfDUc9JHJJkrfBPuZazppGY14Q6FCoSh0G5lyk3NqwLFl/eWJX9zVHzsxVFSEjJaXk0wJtskq833JdVu9iUjVX+q7FqGp5hi5ffpKIyJhyYr6CAuwitKpVqfLuGVii54J5FBt5uFqtMOK30LfuFs3A haLEjiJE 0wyFJnqDnNEED6oe6SWfzdz8zS4wW2ykMmrG1wbzJm5gGPIdOYNdN6P3iUxLHqkNRRI2gv43XaC/wct/B6ANtiJGpuudPlghXEROUbEWQB1hWXWiOMloqR6FsrVENYklDt0PPJZnf5yEi/Hwb6dux6Jhuq2Dh3QBCrTTkzVLfs+D6W1jv17iuly+idtkXhocWGo12ztojS78Ts1zruQi1SqobaKfwnyW/fkp2 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 9/27/23 17:38, Stanislav Kinsburskii wrote: > On Thu, Sep 28, 2023 at 11:00:12AM -0700, Dave Hansen wrote: >> On 9/27/23 17:02, Stanislav Kinsburskii wrote: >>> On Thu, Sep 28, 2023 at 10:29:32AM -0700, Dave Hansen wrote: >> ... >>> Well, not exactly. That's something I'd like to have indeed, but from my >>> POV this goal is out of scope of discussion at the moment. >>> Let me try to express it the same way you did above: >>> >>> 1. Boot some kernel >>> 2. Grow the deposited memory a bunch >>> 5. Kexec >>> 4. Kernel panic due to GPF upon accessing the memory deposited to >>> hypervisor. >> >> I basically consider this a bug in the first kernel. It *can't* kexec >> when it's left RAM in shambles. It doesn't know what features the new >> kernel has and whether this is even safe. >> > > Could you elaborate more on why this is a bug in the first kernel? > Say, kernel memory can be allocated in big physically consequitive > chunks by the first kernel for depositing. The information about these > chunks is then passed the the second kernel via FDT or even command > line, so the seconds kernel can reserve this region during booting. > What's wrong with this approach? How do you know the second kernel can parse the FDT entry or the command-line you pass to it? >> Can the new kernel even read the new device tree data? > > I'm not sure I understand the question, to be honest. > Why can't it? This series contains code parts for both first and seconds > kernels. How do you know the second kernel isn't the version *before* this series gets merged? ... >> I still think the only way this will possibly work when kexec'ing both >> old and new kernels is to do it with the memory maps that *all* kernels >> can read. > > Could you elaborate more on this? > The avaiable memory map actually stays the same for both kernels. The > difference here can be in a different list of memory regions to reserve, > when the first kernel allocated and deposited another chunk, and thus > the second kernel needs to reserve this memory as a new region upon > booting. Please take a step back from your implementation for a moment. There are two basic design points that need to be considered. First, *must* "System RAM" (according to the memory map) be persisted across kexec? If no, then there's no problem to solve and we can stop this thread. If yes, then some mechanism must be used to tell the new kernel that the "System RAM" in the memory map is not normal RAM. Second, *if* we agree that some data must communicate across kexec, then what mechanism should be used? You're arguing for a new mechanism that only new kernels can use. I'm arguing that you should likely reuse an existing mechanism (probably the UEFI/e820 maps) so that *ALL* kernels can consume the information, old and new. I'm not convinced that this series is going in the right direction on either of those points. > Can all this considered, as, say, the first kernel uses device tree to > inform the second kernel about the memory regions to reserve? > In this case the first kernel behaves a bit like a firmware piece for > the second one. > >> Can the hypervisor be improved to make this release operation faster? > > I guess it can, but shutting down guests contributes to downtime the > most. And without shutting down the guests the deposited memory can't be > withdrawn. Do you really need to fully shut down each guest? Or do you just need to get them to a quiescent state where the hypervisor and devices aren't writing to the deposited memory?