From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CA2E1C0218A for ; Tue, 28 Jan 2025 14:04:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0109E28022E; Tue, 28 Jan 2025 09:04:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id F02AF280200; Tue, 28 Jan 2025 09:04:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DA26528022E; Tue, 28 Jan 2025 09:04:06 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id BBD5A280200 for ; Tue, 28 Jan 2025 09:04:06 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 77D90B18C9 for ; Tue, 28 Jan 2025 14:04:06 +0000 (UTC) X-FDA: 83057029692.21.7528299 Received: from mail-qv1-f51.google.com (mail-qv1-f51.google.com [209.85.219.51]) by imf19.hostedemail.com (Postfix) with ESMTP id 843401A000F for ; Tue, 28 Jan 2025 14:04:04 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=ziepe.ca header.s=google header.b=ib8L6q8T; spf=pass (imf19.hostedemail.com: domain of jgg@ziepe.ca designates 209.85.219.51 as permitted sender) smtp.mailfrom=jgg@ziepe.ca; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738073044; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=I+pbeEX0ncNTTz7OQXPKA5QYf+b90MYHKTNJ4etGnu0=; b=cxQo8qxHXG7KBAUN5KlPhHMObBD8HWLG4ELzirz9cTRZgZZttvG7ibf852DVwvzSRZ+UWq iiFHRgnWzR2A/qan0pLZwljkdMvB8OttiWRdioUxg5k2UyQDsIPOUol7RigpYQ0LzD7yOx k7dUCCvFxLZ120QAOzHPb1/C++U0JlA= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738073044; a=rsa-sha256; cv=none; b=L54Bfw1pWkhgWTAfn5HEcPVA8yHpXfE6JUSl4yofq9Wp42qeIZLnlAnIUzBiCibk4VXpH8 yiiash7fH+5MydCKfAAHdP8L0ePM7phCFdj05IGY7OdpCj+3GMXZB9RAYB33LSQ6Bz87+C ce+MGuKYaWKbXTwDzwa/wNmWsyDAaO0= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=ziepe.ca header.s=google header.b=ib8L6q8T; spf=pass (imf19.hostedemail.com: domain of jgg@ziepe.ca designates 209.85.219.51 as permitted sender) smtp.mailfrom=jgg@ziepe.ca; dmarc=none Received: by mail-qv1-f51.google.com with SMTP id 6a1803df08f44-6d92cd1e811so57320836d6.1 for ; Tue, 28 Jan 2025 06:04:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; t=1738073043; x=1738677843; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=I+pbeEX0ncNTTz7OQXPKA5QYf+b90MYHKTNJ4etGnu0=; b=ib8L6q8TOEQnpDDV6QVqm+ZsWiDdX/rgxMUfZHkyWnpg4gdI/43OBQ1DsYqA7E7qja 33My8YyWXYUPcuknuvBcqoLwoSpXZQ1MjPXvIPmElp3UaRrDbuu+Fvem6vRd/qwtgcmU CC97yMqmpo9ijgqYZHnp15XmbuBq/DBnpGe3jWj/EFm+BnGMj66sg7e7Bsi1OQYpdkwy tAduB3iIhVgC6HpBrzW+CWZQaltGHTIm2u9k/DA4hqNkHY5rVk0ZC5GtkxWq23raPUYz X0YQCayThbj4+t37ivtM6oN76HeQvj1v2T1AKjJ+Phdmj77CBPKfjQR4dl86NxAQMt/x y8VQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738073043; x=1738677843; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=I+pbeEX0ncNTTz7OQXPKA5QYf+b90MYHKTNJ4etGnu0=; b=tWq629DcIV8I2LShqmQPN4x/JYZ/pS5rljCEnOEpqQuRG8WkjMn3aXTsZL3usPVKkn CRafm9b0Sl+EwXUAw4d4PFjCeUcaFxYCLH7G13I5y3M9X0xXKp3aLv/p06kbrw8wV464 jnR3VSugYvakVMuCfLTujqnjNk1lwS/tGOwV2a5PW7Pg+0DjLeuQyfirok/flYYDaN2g nuavF2yL6bEqNBARmUKJfdA1qJ0Y447DZDMMjuCi+lkj86ggZATUs+DFmZCEuk/3Fyw0 3H3kaX6cBhQ1Bfp0CHDHmKq1GrlKupRBDjycuLGBzDnfmSfRaI0VvoKCVj9KGnMcqUtQ +FoQ== X-Forwarded-Encrypted: i=1; AJvYcCVUME1bTGjwB+ZjOKrEFfeRonm+wgA7zx8tGX4frDUwXBCp90Ig7F1UYY7QqUmJyLUR0FVnzRzaVw==@kvack.org X-Gm-Message-State: AOJu0YykTFKtsiiWekhlrqcYCN87lW2HtDQ6uMT/RntPy3mpSFzIikPy 1YIJHAOLG0BGHqPhqW5WBahTYO2EdgnPVhdnINkiY+1cPX7RGIyfBJu6B9mJ/4w= X-Gm-Gg: ASbGncuQ75XT7Fq1FQWAhzyIvsuBLS/CfKvN9Fm6xlOymEV+623FjYmxD56ejFzwAi7 ZIqU5NE0vs0Nu0PTFRiCAIvVyxk5mjdL/PqRjmXgo5eEeIGw1kqTGBQhn4hfLSLR33Eoh6mVNhX 47mORNMiHCfvFUsme1QvyRS3xGnB3F/BrPfhl5J6q6cKuYzP3rZN4o/1n04MXER/UAkH6XXwaPe FxpMeebvTvqPU2DMRWs8ZZeDIQqVt+hhZ1ADpcExaffTOkacmhf0e0eSHuf1+w16hXvhBoIEqh1 TYkOO1D9CnF4eWw22HBad2XAYVlzmrC7oGBuM8iFOY/Cd/JqeT3BoEw1gRdpCKpm X-Google-Smtp-Source: AGHT+IGiWB2smKrk5sI+iAzKPeBeRsYdB/aYCm6/Fu3FgZjVsoTxd8IwPSzDD/2o6UFpxNxziGtY+A== X-Received: by 2002:a05:6214:5086:b0:6d8:8283:4466 with SMTP id 6a1803df08f44-6e23683c37bmr59179356d6.18.1738073043358; Tue, 28 Jan 2025 06:04:03 -0800 (PST) Received: from ziepe.ca (hlfxns017vw-142-68-128-5.dhcp-dynamic.fibreop.ns.bellaliant.net. [142.68.128.5]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6e2052498d6sm45508986d6.47.2025.01.28.06.04.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 28 Jan 2025 06:04:02 -0800 (PST) Received: from jgg by wakko with local (Exim 4.97) (envelope-from ) id 1tcmC9-00000007U2W-3xIp; Tue, 28 Jan 2025 10:04:01 -0400 Date: Tue, 28 Jan 2025 10:04:01 -0400 From: Jason Gunthorpe To: Alexander Graf Cc: Pasha Tatashin , Mike Rapoport , David Rientjes , lsf-pc@lists.linux-foundation.org, "Gowans, James" , linux-mm@kvack.org Subject: Re: [LSF/MM/BPF TOPIC] memory persistence over kexec Message-ID: <20250128140401.GB1524382@ziepe.ca> References: <20250120141427.GK674319@ziepe.ca> <20250126200404.GA1103620@ziepe.ca> <54945e03-c437-48b4-b739-4e8ac822c1fc@amazon.com> <20250127131512.GC1103620@ziepe.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Stat-Signature: dhxfmt187gng9cqbpswcdr5iy9ujb7th X-Rspamd-Queue-Id: 843401A000F X-Rspam-User: X-Rspamd-Server: rspam06 X-HE-Tag: 1738073044-562011 X-HE-Meta: U2FsdGVkX19XXJmh4lnbWWvr6ZzZNS0AuXK7tMEm9FjWeI11tCIM6HReIFkPL7oBTOStTNGPLMPPv/lY1wA2VZ9cqFcIlKTRHx4918SpPqC4SwJSnzU/vjyHJCojrWJZLkYVRByOgW6/0bniB0IoQhHIgs9Tp/eUJbNQK15PMoWmEKhDJf5JuN4R24hDwYSHNnDthx1AnwkFTmOLKuNDZDBISypafKo706h27T2cYDv7e1zQvpb34i2poRHi/3iuvVBbih896SL5M8t1+qk1E5TskX9rWNDfLSwjufA7SBkdJs/sAiXxBsmfnwvWJucI9sapOecD5eAyG3vt6fWmn1ti1ks2M/ZATnXLxKMPiT5io1rF0our7R3ChQR3KNWf32NiCA+BbD6o6oQvq+K3Whm3b27HHgzG9XfctQu/GlLsDZAC+B5+frZ25IXv9wY3DI1Yq9PslNkwLUI14TOl7dn+vfYFITMWLqK3z2TPDU5XikxoRI0mnzqg9XyYwwRB36pGXkferJJ+zdDexqIMjbIZXrSb2Ax1SLH1lOT/dLwM4iK9eq1m0R9ZuWRC6iw0ZK1lp6j1RuOBhUwBcG0UNcuIgufRY8KHsoksyuOQ072qwU9u3lgTLISpUQKIX3zCjfPl31CbtjPmloni3s7thzNyk61Nq6/5kLMvlc97z/B0L22H2L4aSwjEf5VjyiihdLUslh1Z8yDFdXPcBcf9ci4oJtZ1h6EPdC/sbKrXbOVjBq/Q4vxrmgcJUcvnOFnJDj2kAG5FocU1xklXtFsPxI1D2hjFZZCUyY4h4CaeLzf6UuMZhM0TFWDEDYKr+W+ndPnKgFGlFw00i0Edgbx2VCIytBwa89fgV5XFNyXLTJgvZq6G38L5e0IyvEv5Rc93VM4Vq5890pNqKi1halpgsVFFLjtG7PfFNY/BRJ7NNd64V3r7gJv+P95xVEIGLuSrEVp3FHGsQGixKDgrtCy bXuANcYU VybB5SuJfV+177D7D4PyiZE+4eWwrSaqvGLENn2Fs8wREzyY+pohGB38X4RSfbvuB7uot5nUaUJEX7YAQPEU7Jpnn20uLEVSatM6yUaj/mu3KuBCfi+qNWd94JT0qyTJQBQcVb4PXdekqvGh8OWkgCdCVOU1OEQ2Co22H6xFF+76AH6Pdo5pc+qybW6CPhKTy+9CvLAJQLB9PQQ5NU7rktkcW0M8DlgS3sPCC7I61rDYaKlCVa7mE9YNCkNTYpAuJkdl43tG0pVrLyGer9OlBVZEeLF9Cc2q6iqVj X-Bogosity: Ham, tests=bogofilter, spamicity=0.000005, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jan 27, 2025 at 08:12:37AM -0800, Alexander Graf wrote: > I agree with the simplifications you're proposing; not using the purgatory > would be a great property to have. > > The reason why KHO doesn't do it yet is that I wanted to keep it simple from > the other end. The big problem with going A/B is that if done the simple > way, you only map B as MOVABLE while running in A. That means A could > accidentally allocate persistent memory from A's memory region. When A then > switches to B, B can no longer make all of A MOVABLE. But you have this basic problem no matter what? kexec requires a pretty big region of linear memory to boot a kernel into. Even with purgatory and copying you still have to have ensure a free linear space that has no KHO pages in it. This seems impossible to really guarentee unless you have a special KHO allocator that happens to guarentee available linear memory, or are doing tricks like we are discussing to use the normal allocator to keep allocations out of some linear memory. > So we need to ensure that *both* regions are MOVABLE, and the system is > always fully aware of both. I imagined the kernel would boot with only the A or B area of memory available during early boot, and then in later boot phases it would setup the additional memory that has a mix of KHO and free pages. This feels easier to do once the allocators are all fully started up - ie you can deal with KHO pages by just allocating them. [*] IOW each A/B area should be large enough to complete alot of boot and would end up naturally containing GFP_KERNEL allocations during this process as it is the only memory available. If you have a special KHO allocator (GFP_KHO?) then it can simply be aware of this and avoid allocating from the A/B zone. However, it would be much nicer to avoid having to mark possible KHO allocations in code at the allocation point, this would be nicer: p = alloc_pages(GFP_KERNEL) // time passes to_kho(p) So I agree there is an appeal to somehow using the existing allocators to stop taking unmovable pages from the A/B region after some point so that no to_kho() will ever get a page that in A/B. Can you take a ZONE_NORMAL, use it for booting, and then switch it to ZONE_MOVABLE, keeping all the unmovable memory? Something else? * - For drivers I'm imaging that we can do: p = alloc_pages(GFP_KERNEL|GFP_KHO|GFP_COMP, order); to_kho(p); // kexec from_kho(p); folio_put(p) Meaning KHO has to preserve the folio, keep the KVA the same, manage the refcount, and restore the GFP_COMP. I think if you have this as the basic primitive you can build everything else on top of it. Jason