From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D6909C761A6 for ; Mon, 3 Apr 2023 08:29:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 660616B0072; Mon, 3 Apr 2023 04:29:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 60EE76B0074; Mon, 3 Apr 2023 04:29:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4B0386B0075; Mon, 3 Apr 2023 04:29:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 3CC466B0072 for ; Mon, 3 Apr 2023 04:29:05 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id EEC5FAB4CE for ; Mon, 3 Apr 2023 08:29:04 +0000 (UTC) X-FDA: 80639404608.20.8FDA9C1 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf09.hostedemail.com (Postfix) with ESMTP id 996EA140016 for ; Mon, 3 Apr 2023 08:29:02 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="c/4Da9ou"; spf=pass (imf09.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1680510542; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zHlXHCbBRbgkqEnJ1s42nDOWMLzWIlrNrLvgAIWCdKY=; b=bzWuYOfSVFg+MRKhQpmO3Ux+1NTYz6k8lGlOU5BlbIP/2pTuJ0W/eALAjt28AsHvuPMBpe LPTjk6IM/lmvfol1cf4aGGiqtlJCumuWiC6scL92ZVfxeyDsEiOJfD8pY9MwZRqsbOaDlV m4fhscBUU/2UYXi4adLCWL68eVVtnNw= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="c/4Da9ou"; spf=pass (imf09.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1680510542; a=rsa-sha256; cv=none; b=F2gp36SPL1iLZHbg/nDgy+JQhi4tgRAkRp9vSshQNoz5GHPWIusEhQvpEr1zsW6FXx/zne A5o9fxYQZ0Q0nbpL8NW3jOxFNMX98gE18kW1zGtZxdZ0WcSmAttqM3gf/riMHQLnYzwIOo Hr9pfKVOPVvpUkREJExOmpxptzweRiM= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1680510541; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=zHlXHCbBRbgkqEnJ1s42nDOWMLzWIlrNrLvgAIWCdKY=; b=c/4Da9oui4XB4JrxX8ttEEGfmIffTfrgQN96NLJBBDhafrBEsDmP2+uUcHWuolfsG68PiF zOrdF1Jx2e58jCnPiJfYfJ/RaVz+s33U43oLHruRlm0r33w6op38+9abbGduA0oVFU9WwK 01KA0eX62MLOKF2xu0s8GWJkzm8eFXw= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-600-BO7pm9jGMPil-ACcxKw1fg-1; Mon, 03 Apr 2023 04:29:00 -0400 X-MC-Unique: BO7pm9jGMPil-ACcxKw1fg-1 Received: by mail-wm1-f70.google.com with SMTP id o37-20020a05600c512500b003edd119ec9eso14183566wms.0 for ; Mon, 03 Apr 2023 01:29:00 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680510539; h=content-transfer-encoding:in-reply-to:organization:from:references :cc:to:content-language:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=zHlXHCbBRbgkqEnJ1s42nDOWMLzWIlrNrLvgAIWCdKY=; b=hM8KpdiKL1RIPMrFqObRpe/SIOpEx2Hp2llg1O5gn6kZ7P/+gRa8EpMcHp5923LF6c tsdQVTu4sxx3v7LYODd5fOHCZI8xeIT4no1TLfqLHSK7imMOIaAGk/yTnmM2LaIuWY1P 2mIxZAHuLwOh84BcKSOMv6xn8yxsVQZ/gH7+LRTTekuVwmKLdqrc4sC1mvR3NMJtgPC+ 5EVHrQNhyBIZcQJDBfTZroV3EPXr8NXISogj206K11ndSn2dJT2cMD8Mzfhdccros+gc +vKFEUxZZHixsiMOJI1h1F9qzcz6QtWJpSIWtaxbIFHBTTOtUuDX9UTKfH95y3yfqH6l /XVg== X-Gm-Message-State: AAQBX9cDJx86yuztnfU6oP22SJi2vu/8gFvT2CKmnuLMSwelUMluJod6 Oeih5J3e010mIH4OeHvP9K/w85zz+tAe5kMSC9kNLIgBEhhjbS2fzJB0zNrDjtQTfwByx2s3qpl hMaamJk31A90= X-Received: by 2002:adf:ea10:0:b0:2da:e8ac:6986 with SMTP id q16-20020adfea10000000b002dae8ac6986mr28538463wrm.10.1680510539190; Mon, 03 Apr 2023 01:28:59 -0700 (PDT) X-Google-Smtp-Source: AKy350YKzTc97TMUJzf2ne0eFFWJjDRCW/J/AYuOPLkH9UWzYlAp+oF5yMyLv5Ql9F1GAlj7Dh0U6g== X-Received: by 2002:adf:ea10:0:b0:2da:e8ac:6986 with SMTP id q16-20020adfea10000000b002dae8ac6986mr28538444wrm.10.1680510538856; Mon, 03 Apr 2023 01:28:58 -0700 (PDT) Received: from ?IPV6:2003:cb:c702:5e00:8e78:71f3:6243:77f0? (p200300cbc7025e008e7871f3624377f0.dip0.t-ipconnect.de. [2003:cb:c702:5e00:8e78:71f3:6243:77f0]) by smtp.gmail.com with ESMTPSA id f14-20020adff58e000000b002e52dfb9256sm9137370wro.41.2023.04.03.01.28.57 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 03 Apr 2023 01:28:58 -0700 (PDT) Message-ID: <25451d4f-978e-8106-3ee6-e9b382bb87a3@redhat.com> Date: Mon, 3 Apr 2023 10:28:57 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.1 Subject: Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL To: Kyungsan Kim Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-cxl@vger.kernel.org, a.manzanares@samsung.com, viacheslav.dubeyko@bytedance.com, dan.j.williams@intel.com, seungjun.ha@samsung.com, wj28.lee@samsung.com References: <7c7933df-43da-24e3-2144-0551cde05dcd@redhat.com> <20230331114220.400297-1-ks0204.kim@samsung.com> From: David Hildenbrand Organization: Red Hat In-Reply-To: <20230331114220.400297-1-ks0204.kim@samsung.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: xbm4b1bbhi359dddosnmh9rycps6fkt6 X-Rspamd-Queue-Id: 996EA140016 X-HE-Tag: 1680510542-727804 X-HE-Meta: U2FsdGVkX18JW4e65/JYWmz6V9Sm0+JElUYLnO0tsiAwzivWJtD5YZ9Kl/Jkm+Hc3h3rV0/EWnCK/ELAjwQVfT2ulsXwEpUgVGgemt1vsCtF7qGxxd6BHrFglwJrc8BIjr/oxVJD50+/uRE/qcYbrjoKIsSdioAP7aU03wZXjpEjlnA1yan+Y3illJqo4YeNhzFhpx0H0LujWPOXFS7vkSJ23WvTXS1y01CBYEJ6QZJEPskoGVOWAKVbCJWQ2fXqGw6cMFNKo9ehRXwS1o2FFNKD6juJnppl579mWMe7/1auykBWITWVsT7qActclp0BdxNLpR7Zv3GkD1mwpQMfRrtJI8G9NocvVeSeOR/PzMd/0dK90s0/MczfD7Y2bxjYQkOLwHn49I2PHTCjlrHfpFZOa5wdfHrLHtYwX4tZCQzXxttIeQpX7PcSwREW/zAwH+wfDkvZ5RymmY+KzsTgI4Rg//3TjQOqqjwDMhCBfNH/Mxlh2AOInBmeQBz0Bw9saCF1bhf+nDtjkTpabMMbh9uHIgLRCZ1DruX4F0u7Urs6m9EwJoB3IBAWewgVscioti7MswZHVhCDVjvmtrhMH9SQ4rR4lH7r60KTB9cZPiF7Ag7seyEvbYP7Y+VhVowPBSYbT3du4NkBCYZ6u/akOZ/k+/D9cvMd8w8NmhqHAu+SqajkDAIcS5JHj0vwPKc/EiazUxTTYjG6J1UxVSYhOHb3CQS0uGq7S534YlgF+ndfQ6K0cfbSQ1KuQektITj7pQeqeZAUm776QPO01sNIZQHYdpFSWejpDsjsXWHtCF1zOsf5iIJDDy2v/ojUUN2I2kOho2eD/ezsTcDBVLPeMHkg+iMEwjkYb3WuqwkfJhAfcPIQsJoilaemnWjHlx8dxYZpyXFhDnH38cUkt4V4/7gLoouIMomZPP5CjH+oaW3+7pPIqOcRKQZuSeXZGix0sOexOLNoEdKYxwxZAzL vFGcH/4B jl6D8Ak46jujNzfdJVwpPdlg9H/+cV6JfFP54xEhDHlZ2Jg+n1mr6/ASFDbbHecaH7YvH9tLUms+ACQXmANVju4eLgODWdUCgXT24lL2C5aLBWJNt3HWzS60rGz+DyPN2iiBr9leCpo4fyO/vPNLdIRTxzI2yAyvblO/XjIzAjn0eoiGYwbEarGfcFYN4xmGLCrBUh43J7wfmOVf6Wm6Sjs1pB+og5cWN/l0iemeFNcxU+jVQExep5CrkIaqu15xig2GUdPGlcTf48qo7UVCmCXCQrvKBlf7arpIEpsTbQEsJxC4W5vEUEfxmIpZWwevoNz/J490nkIfMU0YRFP7rmgRjDqGK5GfPi9cwaNvZ2zXC/YODW1i/BuyCefBN5CQ2Br4n8qnvBTkCrKfR8U2cAg20yjlvoi5g2cdh6dKVsUQfudA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 31.03.23 13:42, Kyungsan Kim wrote: >> On 24.03.23 14:08, Jørgen Hansen wrote: >>> >>>> On 24 Mar 2023, at 10.50, Kyungsan Kim wrote: >>>> >>>>> On 24.03.23 10:27, Kyungsan Kim wrote: >>>>>>> On 24.03.23 10:09, Kyungsan Kim wrote: >>>>>>>> Thank you David Hinderbrand for your interest on this topic. >>>>>>>> >>>>>>>>>> >>>>>>>>>>> Kyungsan Kim wrote: >>>>>>>>>>> [..] >>>>>>>>>>>>> In addition to CXL memory, we may have other kind of memory in the >>>>>>>>>>>>> system, for example, HBM (High Bandwidth Memory), memory in FPGA card, >>>>>>>>>>>>> memory in GPU card, etc. I guess that we need to consider them >>>>>>>>>>>>> together. Do we need to add one zone type for each kind of memory? >>>>>>>>>>>> >>>>>>>>>>>> We also don't think a new zone is needed for every single memory >>>>>>>>>>>> device. Our viewpoint is the sole ZONE_NORMAL becomes not enough to >>>>>>>>>>>> manage multiple volatile memory devices due to the increased device >>>>>>>>>>>> types. Including CXL DRAM, we think the ZONE_EXMEM can be used to >>>>>>>>>>>> represent extended volatile memories that have different HW >>>>>>>>>>>> characteristics. >>>>>>>>>>> >>>>>>>>>>> Some advice for the LSF/MM discussion, the rationale will need to be >>>>>>>>>>> more than "we think the ZONE_EXMEM can be used to represent extended >>>>>>>>>>> volatile memories that have different HW characteristics". It needs to >>>>>>>>>>> be along the lines of "yes, to date Linux has been able to describe DDR >>>>>>>>>>> with NUMA effects, PMEM with high write overhead, and HBM with improved >>>>>>>>>>> bandwidth not necessarily latency, all without adding a new ZONE, but a >>>>>>>>>>> new ZONE is absolutely required now to enable use case FOO, or address >>>>>>>>>>> unfixable NUMA problem BAR." Without FOO and BAR to discuss the code >>>>>>>>>>> maintainability concern of "fewer degress of freedom in the ZONE >>>>>>>>>>> dimension" starts to dominate. >>>>>>>>>> >>>>>>>>>> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases. >>>>>>>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time. >>>>>>>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation. >>>>>>>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag. >>>>>>>> >>>>>>>>> That sounds like a bad hack :) . >>>>>>>> I consent you. >>>>>>>> >>>>>>>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped. >>>>>>>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases. >>>>>>>> >>>>>>>>> I once raised the idea of a ZONE_PREFER_MOVABLE [1], maybe that's >>>>>>>>> similar to what you have in mind here. In general, adding new zones is >>>>>>>>> frowned upon. >>>>>>>> >>>>>>>> Actually, we have already studied your idea and thought it is similar with us in 2 aspects. >>>>>>>> 1. ZONE_PREFER_MOVABLE allows a kernelspace allocation using a new zone >>>>>>>> 2. ZONE_PREFER_MOVABLE helps less fragmentation by splitting zones, and ordering allocation requests from the zones. >>>>>>>> >>>>>>>> We think ZONE_EXMEM also helps less fragmentation. >>>>>>>> Because it is a separated zone and handles a page allocation as movable by default. >>>>>>> >>>>>>> So how is it different that it would justify a different (more confusing >>>>>>> IMHO) name? :) Of course, names don't matter that much, but I'd be >>>>>>> interested in which other aspect that zone would be "special". >>>>>> >>>>>> FYI for the first time I named it as ZONE_CXLMEM, but we thought it would be needed to cover other extended memory types as well. >>>>>> So I changed it as ZONE_EXMEM. >>>>>> We also would like to point out a "special" zone aspeact, which is different from ZONE_NORMAL for tranditional DDR DRAM. >>>>>> Of course, a symbol naming is important more or less to represent it very nicely, though. >>>>>> Do you prefer ZONE_SPECIAL? :) >>>>> >>>>> I called it ZONE_PREFER_MOVABLE. If you studied that approach there must >>>>> be a good reason to name it differently? >>>>> >>>> >>>> The intention of ZONE_EXMEM is a separated logical management dimension originated from the HW diffrences of extended memory devices. >>>> Althought the ZONE_EXMEM considers the movable and frementation aspect, it is not all what ZONE_EXMEM considers. >>>> So it is named as it. >>> >>> Given that CXL memory devices can potentially cover a wide range of technologies with quite different latency and bandwidth metrics, will one zone serve as the management vehicle that you seek? If a system contains both CXL attached DRAM and, let say, a byte-addressable CXL SSD - both used as (different) byte addressable tiers in a tiered memory hierarchy, allocating memory from the ZONE_EXMEM doesn’t really tell you much about what you get. So the client would still need an orthogonal method to characterize the desired performance characteristics. This method could be combined with a fabric independent zone such as ZONE_PREFER_MOVABLE to address the kernel allocation issue. At the same time, this new zone could also be useful in other cases, such as virtio-mem. >> >> Yes. I still did not get a satisfying answer to my original question: >> what would be the differences between both zones from a MM point of >> view? We can discuss that in the session, of course. >> >> Regarding performance differences, I thought the idea was to go with >> different nodes to express (and model) such. >> > > From a MM point of view on the movability aspect, a kernel context is not allocated from ZONE_EXMEM without using GFP_EXMEM explicitly. > In contrast, if we understand the design of ZONE_PREFER_MOVABLE correctly, a kernel context can be allocated from ZONE_PREFER_MOVABLE implicitly as the fallback of ZONE_NORMAL allocation. > However, the movable attribute is not all we are concerning. > In addition, we experienced page allocation and migration issue on the heterogeneous memories. > > Given our experiences/design and industry's viewpoints/inquiries, > I will prepare a few slides in the session to explain > 1. Usecase - user/kernespace memory tiering for near/far placement, memory virtualization between hypervisor/baremetal OS > 2. Issue - movability(movable/unmovable), allocation(explicit/implicit), migration(intented/unintended) > 3. HW - topology(direct, switch, fabric), feature(pluggability,error-handling,etc) Yes, especially a motivation for GFP_EXMEM and ZONE_EXMEM would be great. New GFP flags and zone are very likely a lot of upstream pushback. So we need a clear motivation and discussion of alternatives (and why this memory has to be treated so special but still wants to be managed by the buddy). Willy raises some very good points. -- Thanks, David / dhildenb