From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CFA25C77B75 for ; Wed, 17 May 2023 08:09:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5A03F900005; Wed, 17 May 2023 04:09:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 54F86900003; Wed, 17 May 2023 04:09:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 417EE900005; Wed, 17 May 2023 04:09:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 3384E900003 for ; Wed, 17 May 2023 04:09:39 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 00D09404C7 for ; Wed, 17 May 2023 08:09:38 +0000 (UTC) X-FDA: 80799022878.28.0937458 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf02.hostedemail.com (Postfix) with ESMTP id B12F68000E for ; Wed, 17 May 2023 08:09:36 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Mm863mKG; spf=pass (imf02.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1684310976; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=01liRdz6Q+TmEyjnaMxU3rDH25cRBYMipC4CBhtj1oc=; b=terjRSIhxbfncFrmiRCjzbk9UwwOHn7fzzxUZ06Jyvy53AzfjUy4MHZttTrzGiEt2Ph2MN 4mDaz264w3NcyZfabtmyZNkNoz/M/qpdjcrVoTWZtuNVa7Xkd2/0fj2IizSIxGReysgu/e jg8QLOmiyJ26Gd7QfX92FcdmsLgIf98= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684310976; a=rsa-sha256; cv=none; b=SpJrpjpIsIdZ8GBNUudD3DK7dE5ZLFQAVhCrF1E+2m4LYKIfOn8mglda5+IJkVSyjnM1RR d3U46uaaeLq4DZy+HLkuhTY6//vkSNTb5LG39iPckzG0N2NkT7gJbpnaSnKtOvP76LMFUN AXDI5Ig4EWHUOHV7FFq1Qt69+M2vjGM= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Mm863mKG; spf=pass (imf02.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1684310976; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=01liRdz6Q+TmEyjnaMxU3rDH25cRBYMipC4CBhtj1oc=; b=Mm863mKGdBRtcD4WMdLQDr+feQgF8vaWgMIXDiuNI+9ougyiOtrmYohrHGjSV+L9r8i7eO LAOrBUzt48iAF6cwvQNUrG9jsmFNh1mirEeVfpNd2Rj/y0z/FQk5EJFfihoobTt82ZY+lY RQqc9ffF7a5pvpA50Vx5AB7STxr7TBc= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-394-iOaNFw3hPp6b_qJWPgSWog-1; Wed, 17 May 2023 04:09:34 -0400 X-MC-Unique: iOaNFw3hPp6b_qJWPgSWog-1 Received: by mail-wr1-f69.google.com with SMTP id ffacd0b85a97d-306281812d6so172307f8f.2 for ; Wed, 17 May 2023 01:09:34 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684310973; x=1686902973; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=01liRdz6Q+TmEyjnaMxU3rDH25cRBYMipC4CBhtj1oc=; b=MBnJmICH4S8VhLMpnwgrfA5x5zpIoT+6jpNuFjFEu4zFa0bszw8te8YS0icPvu9njk rGSkoDAT799AO00kkZL5h9hNf70UcGlsx2rPk85I6QCp60Y1fpbvkycKU88rlpqOhyB0 ECMUST9Zoogpkg9L/pzCc4XXeOIr7aQPtBOyInXJYJy9yyMFrpkyxvh1tY2rOQMQFMNh AOj/HolOCDJzNXk1blTchVuicOeyq9pEzfyZcDmuryKUV9fZvGZfNa3uSrOmE1uHVNuB QYSkZKgukO2QQzrT5czjHo6PvEQyE+gULo2sl1KWGlHpiu38eCnznAEzhFJkLpuI2+p1 lkJw== X-Gm-Message-State: AC+VfDxUzAbqy1FD3qrz6Bb4b/gpWNrZUXbFFDVWqokwh1pm3QYScOay dfeJftBRQ7b/ua4Pew0ZAGkAV9lsrNkCCBLhY3lBzIPPFQhhMY0U5FTIcnsXZT995EATfQwHtLv EPrQVWlUTzA8= X-Received: by 2002:a5d:568e:0:b0:309:3e6f:6725 with SMTP id f14-20020a5d568e000000b003093e6f6725mr912949wrv.4.1684310973543; Wed, 17 May 2023 01:09:33 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ46X9BkjYgsxlk5mS9oQkAdmElgN6LCRa1pV6ETGXekFADfe8w0UP+ZRbuQZth1xiZ2DP++gQ== X-Received: by 2002:a5d:568e:0:b0:309:3e6f:6725 with SMTP id f14-20020a5d568e000000b003093e6f6725mr912932wrv.4.1684310973159; Wed, 17 May 2023 01:09:33 -0700 (PDT) Received: from [192.168.3.108] (p4ff23b51.dip0.t-ipconnect.de. [79.242.59.81]) by smtp.gmail.com with ESMTPSA id e17-20020adfe7d1000000b00300aee6c9cesm1986633wrn.20.2023.05.17.01.09.31 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 17 May 2023 01:09:32 -0700 (PDT) Message-ID: Date: Wed, 17 May 2023 10:09:31 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 To: "Huang, Ying" Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Arjan Van De Ven , Andrew Morton , Mel Gorman , Vlastimil Babka , Johannes Weiner , Dave Hansen , Pavel Tatashin , Matthew Wilcox References: <20230511065607.37407-1-ying.huang@intel.com> <87r0rm8die.fsf@yhuang6-desk2.ccr.corp.intel.com> <87jzx87h1d.fsf@yhuang6-desk2.ccr.corp.intel.com> <3d77ca46-6256-7996-b0f5-67c414d2a8dc@redhat.com> <87bkij7ncn.fsf@yhuang6-desk2.ccr.corp.intel.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [RFC 0/6] mm: improve page allocator scalability via splitting zones In-Reply-To: <87bkij7ncn.fsf@yhuang6-desk2.ccr.corp.intel.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Stat-Signature: 9jd3dheg63339ybqxxmnna7mseg1ibn1 X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: B12F68000E X-HE-Tag: 1684310976-830314 X-HE-Meta: U2FsdGVkX1/u9n+0SdrYZVXOBgoxPAFDQc719A0w9+RyJDgZQ4lRtxZWJrIuguDg0KBDHhWhtYXROshHT+isXZ5sy3LcfYS5+Ydc0I7umd8nnSlf4H0AaB0YH5l7va+3QhMYL7rBnf2K40slXSoFf+IjWrYtqWLEk/utcAKIlKCOxXyNQUKAD+qDjK1yerBNIk9mWESjGmyzgqTMbMW3CU7+Wzys0PP1Kk/h2z1QVoisRQNTCD94NRNPVr3F+aq//VtK0wAOGfnst0PQPJpalJ1MqEFf/Q74WFKnUXu3++BhKCMU1KWnMhkospxNTVH6p7iLdorIRt2G5fJVonWBuYMvknjnuTC7vFbin6tO9TmzCEOeJoO+9izlyHP6B06/EsXig9QeKnso0eoakhzQqWUNocPy9can+3r++XY06amIYOc8j7Ox0u7wNiKOpaclX+/ds/wHiNfIlLo0/vt8CQ+IZWCAD9k5yYgl8Ct4mFFQ+dK88O5Ix5yTrJQv1l5kGsX1Nyw1/HLbjnLW394jQAzrRf4eFKHnhxwxhLr66Gh/rup2fxVBK6e7vZqYa9LFmeKySAIShqRNKlEG5npKorquMmIx2A6dxddPr5xZhi63ZlONklSy6S+3bDIu+oxot+/tfTP1fh9U0pfpuhqQc5Y5pHVgPQHwjV5NKoNicg66HFl/4gWgieD+UmLBjIP8D9n2flPJdK3SZQ2FMbV0N7AoDSr8dgMXhttGefvjJARiOFHOw/vl1yhTBllJ2JeLJr3G7AIiUsRYlZgwWSH/PM5vYFuFLMTI+CZWJ6eZS951SQTTeHW2vsMCxouWpCSEvBgVZ6Ox6TOSXWvf73GQ9IBIzsJILOUC92qUEhDcWbXhRj7pl23QjwKG4y+IsJX/EjEzus/ngovhW183TF2aAKiJaaueAcZhhqZNU59gZFrK7Ko746vwuA02lu9zirYkAg8KmZwdEiTtQPHUpWi d2fFq9YY PwNWrSUtCCG5GTzDdMS9sAM+NTbIbNJN4DL+LdPKAuAJESPY7p57jCBuxwGgGT7dy92JhxyzG87FoKVU7ve3mTPQLJqFLVm5i/Kzqyn0PYEL829/KurcsORfozK7uSNg/kQJ0l3ZuR58Im9VmjIm3j/dwgZy+sP5nYXme79EaLzNCdYAuas6/ekcnBC+F6zJRDA9oQhdzaYHVJq6Qnwmko24/hbsxYKbWDXMNjzWiddHibf2IdZ0Ea+Z5NV2P+myeRWcp5qyKQTJZB+jH3f3ONHyq5P5GNNnumOOXi7L2sAL9F6YwjzWPIE+rpXofWYLyRyD4F5gZq5QpdwwFdr9pk8lZtAvdEaKpocZbdSWUHzrZI3EYZ7i2Ml1aI+dD3oALwWG6V/QcJjLEXqX5U/MQao6ozw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: >> If we could avoid instantiating more zones and rather improve existing >> mechanisms (PCP), that would be much more preferred IMHO. I'm sure >> it's not easy, but that shouldn't stop us from trying ;) > > I do think improving PCP or adding another level of cache will help > performance and scalability. > > And, I think that it has value too to improve the performance of zone > itself. Because there will be always some cases that the zone lock > itself is contended. > > That is, PCP and zone works at different level, and both deserve to be > improved. Do you agree? Spoiler: my humble opinion Well, the zone is kind-of your "global" memory provider, and PCPs cache a fraction of that to avoid exactly having to mess with that global datastructure and lock contention. One benefit I can see of such a "global" memory provider with caches on top is is that it is nicely integrated: for example, the concept of memory pressure exists for the zone as a whole. All memory is of the same kind and managed in a single entity, but free memory is cached for performance. As soon as you manage the memory in multiple zones of the same kind, you lose that "global" view of your memory that is of the same kind, but managed in different bucks. You might end up with a lot of memory pressure in a single such zone, but still have plenty in another zone. As one example, hot(un)plug of memory is easy: there is only a single zone. No need to make smart decisions or deal with having memory we're hotunplugging be stranded in multiple zones. > >> I did not look into the details of this proposal, but seeing the >> change in include/linux/page-flags-layout.h scares me. > > It's possible for us to use 1 more bit in page->flags. Do you think > that will cause severe issue? Or you think some other stuff isn't > acceptable? The issue is, everybody wants to consume more bits in page->flags, so if we can get away without it that would be much better :) The more bits you want to consume, the more people will ask for making this a compile-time option and eventually compile it out on distro kernels (e.g., with many NUMA nodes). So we end up with more code and complexity and eventually not get the benefits where we really want them. > >> Further, I'm not so sure how that change really interacts with >> hot(un)plug of memory ... on a quick glimpse I feel like this series >> hacks the code such that such that the split works based on the boot >> memory size ... > > Em..., the zone stuff is kind of static now. It's hard to add a zone at > run-time. So, in this series, we determine the number of zones per zone > type based on boot memory size. This may be improved in the future via > pre-allocate some empty zone instances during boot and hot-add some > memory to these zones. Just to give you some idea: with virtio-mem, hyper-v, daxctl, and upcoming cxl dynamic memory pooling (some day I'm sure ;) ) you might see quite a small boot memory (e.g., 4 GiB) but a significant amount of memory getting hotplugged incrementally (e.g., up to 1 TiB) -- well, and hotunplugged. With multiple zone instances you really have to be careful and might have to re-balance between the multiple zones to keep the scalability, to not create imbalances between the zones ... Something like PCP auto-tuning would be able to handle that mostly automatically, as there is only a single memory pool. > >> I agree with Michal that looking into auto-tuning PCP would be >> preferred. If that can't be done, adding another layer might end up >> cleaner and eventually cover more use cases. > > I do agree that it's valuable to make PCP etc. cover more use cases. I > just think that this should not prevent us from optimizing zone itself > to cover remaining use cases. I really don't like the concept of replicating zones of the same kind for the same NUMA node. But that's just my personal opinion maintaining some memory hot(un)plug code :) Having that said, some kind of a sub-zone concept (additional layer) as outlined by Michal IIUC, for example, indexed by core id/has/whatsoever could eventually be worth exploring. Yes, such a design raises various questions ... :) -- Thanks, David / dhildenb