From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8A21FC10F06 for ; Thu, 28 Mar 2019 21:32:56 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2FE582184E for ; Thu, 28 Mar 2019 21:32:56 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=intel-com.20150623.gappssmtp.com header.i=@intel-com.20150623.gappssmtp.com header.b="XUSFmkDp" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2FE582184E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C92F56B028E; Thu, 28 Mar 2019 17:32:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C67F16B0292; Thu, 28 Mar 2019 17:32:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B56406B0293; Thu, 28 Mar 2019 17:32:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from mail-ot1-f72.google.com (mail-ot1-f72.google.com [209.85.210.72]) by kanga.kvack.org (Postfix) with ESMTP id 81DD36B028E for ; Thu, 28 Mar 2019 17:32:55 -0400 (EDT) Received: by mail-ot1-f72.google.com with SMTP id n2so178189otk.19 for ; Thu, 28 Mar 2019 14:32:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:mime-version:references :in-reply-to:from:date:message-id:subject:to:cc; bh=RRoz1sH8AOofMZyH/MnRoYJlFBjef59CBPNfTKquYe0=; b=TmX99CavYxHZVNG8pqJ57AEH8+Dw1GW+53BVO6JYVu59l8kMN4ffmSzGPxXwWSaahr LHIAWv2mMx5H30wrQhVnBvTe1oey6W/5kuAWl6eeG+YIGnRNYTf3AtZnwtRZN7hUxfDN M9EoYaGhEBtcIiu31TvRXzbB+K+CJqe2nIWuioapifxEKLdSbz8XKl69gnjFLvi0BGpn dkEVWLUYd+MRo9ZZO1pz3MBjh571zdp/Z7FeGCMaMnCLUu3hPskL4JnC1SDhRonBDAMm HYC9E9esYdnLLvg1MvA/yDq3vpw4/SsOpKTyO1BHyi+VPq+ONYkb9uB7kmZ+9Tw3ckrX vStA== X-Gm-Message-State: APjAAAW399xygnh0JXHpO5F6dt9CrOsI08bfXJDANyTm5GAWeaRm9PFz ESD96F+MNp1Cx6g3uB0fXBASXJdAlsoKFSF/lCtw4SZnB12k9SHy6Uof4tJg+v9wW0WEGh+juR0 lel3HP22zpmDL+ijzbrb2JD/CmqDVOt+ANMcyM6MBg9htqPVobXFzzTa1zaN1EwviAg== X-Received: by 2002:aca:f105:: with SMTP id p5mr1598667oih.172.1553808775039; Thu, 28 Mar 2019 14:32:55 -0700 (PDT) X-Received: by 2002:aca:f105:: with SMTP id p5mr1598609oih.172.1553808773699; Thu, 28 Mar 2019 14:32:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1553808773; cv=none; d=google.com; s=arc-20160816; b=RD8fTPTE0rlxahRA3HCmYYikqeFbZMOyHRKhYd9g4oMHxSz00Geh2OBV1itR3Ia7gB BvyJ9hToVwZXQxUXxWnMyc2mvmTzY4Ubp7haDAqpOZfowsLG5+HjH0TQQud75QFrTOXx IvLqVLYFksocYAkHHp6URWfGyTZs3rMrReQFeMRS/HdMSAzoC3Du5wwfaK8SeeQWdzFa BwW3xVI4g6j2pldQ2azTYPOdgJpZ6ldB8T5Ucp4i2DFQgG60JmGQGHbLwToU0wyG017C CnMvikcydYVGHq+7qOtE5JABOlyeA/QmQtJfsJ+Jw75inIsgs0XIasrrCi/uBnXsbrb1 NLRA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=RRoz1sH8AOofMZyH/MnRoYJlFBjef59CBPNfTKquYe0=; b=l9DQk+eWAit+EOaGFcF+XsXkjRoCrPCr7mZz/R9pxbMm+w5hNl8f166EvjTEhw4qyM aXSES2HsC04iN6NM9uzw9Yh8+KJruRZNJt/dY8go9mrb8FfPj1ZT/48jBxZVLaw0OX1+ 7SLm1O8YkE671/IMbo/r5JZoxah8Lz7z41M9fZp5i2M0Lp16uUeOhHTRwDwF6um6QT3b YpuuDYRd6sSX3f/GV1wANmMA88e7FR2hLTQiocYR6uJX8Qs4jtKo9NaJwvb9M/zeUDyu LWHLH2NbBMkpgI/JJ9drr/58ghYXqOXqgTJS+kV+STiLIj7dnO3B4yqjfoXChNt+Rqxc 9JZQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=XUSFmkDp; spf=pass (google.com: domain of dan.j.williams@intel.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id p124sor15799571oia.101.2019.03.28.14.32.53 for (Google Transport Security); Thu, 28 Mar 2019 14:32:53 -0700 (PDT) Received-SPF: pass (google.com: domain of dan.j.williams@intel.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=XUSFmkDp; spf=pass (google.com: domain of dan.j.williams@intel.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=RRoz1sH8AOofMZyH/MnRoYJlFBjef59CBPNfTKquYe0=; b=XUSFmkDpyT1FlmTVuJgj8AVeBKaifRG50kAWV2XVx1ZSbspXYH/stLc58OnC8J6CFm /ty5nihn/W3YNpRhmPaVs/5w2gcTBkydt/gis8VHVR9VlX+TepxgfRvCYTEGCzVjvd++ I7DhOv9kbOqFuMgFml173hKY7f3blb/ng6FKqkyjRl1UYff7tgasvA72sBVCgr+E3Oy4 z3RLbTYXJR7DyP4ZdGQalj82e/4fION5qdG5Oaq/vZ6rGiFl23k9Ag/OUdWGL1PWysT7 jVd/xFg120XvqaZmXDD5I0pxQCpHDRrp13LdWfLyyGJ53+KBwvUyfqGA3VvI+kHI5T0C lvyA== X-Google-Smtp-Source: APXvYqxqLnrUqVLmH+LUT6WoSkbGJrNWil1XO16wT4zRg/nNkV4EjSSHUQ8yfhx/VHyBOly9bybJR+XutJw5+D+MRzQ= X-Received: by 2002:aca:f581:: with SMTP id t123mr1614964oih.0.1553808772940; Thu, 28 Mar 2019 14:32:52 -0700 (PDT) MIME-Version: 1.0 References: <155327387405.225273.9325594075351253804.stgit@dwillia2-desk3.amr.corp.intel.com> <24c163f2-3b78-827f-257e-70e5a9655806@redhat.com> In-Reply-To: <24c163f2-3b78-827f-257e-70e5a9655806@redhat.com> From: Dan Williams Date: Thu, 28 Mar 2019 14:32:42 -0700 Message-ID: Subject: Re: [PATCH v5 00/10] mm: Sub-section memory hotplug support To: David Hildenbrand Cc: Andrew Morton , =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= , Logan Gunthorpe , Toshi Kani , Jeff Moyer , Michal Hocko , Vlastimil Babka , stable , Linux MM , linux-nvdimm , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Mar 28, 2019 at 2:17 PM David Hildenbrand wrote: > > >> You are using the term "Sub-section memory hotplug support", but is it > >> actually what you mean? To rephrase, aren't we talking here about > >> "Sub-section device memory hotplug support" or similar? > > > > Specifically it is support for passing @start and @size arguments to > > arch_add_memory() that are not section aligned. It's not limited to > > "device memory" which is otherwise not a concept that > > arch_add_memory() understands, it just groks spans of pfns. > > Okay, so everything that does not have a memory block devices as of now. > > > > >> Reason I am asking is because I wonder how that would interact with the > >> memory block device infrastructure and hotplugging of system ram - > >> add_memory()/add_memory_resource(). I *assume* you are not changing the > >> add_memory() interface, so that one still only works with whole sections > >> (or well, memory_block_size_bytes()) - check_hotplug_memory_range(). > > > > Like you found below, the implementation enforces that add_memory_*() > > interfaces maintain section alignment for @start and @size. > > > >> In general, mix and matching system RAM and persistent memory per > >> section, I am not a friend of that. > > > > You have no choice. The platform may decide to map PMEM and System RAM > > in the same section because the Linux section is too large compared to > > typical memory controller mapping granularity capability. > > I might be very wrong here, but do we actually care about something like > 64MB getting lost in the cracks? I mean if it simplifies core MM, let go > of the couple of MB of system ram and handle the PMEM part only. Treat > the system ram parts like memory holes we already have in ordinary > sections (well, there we simply set the relevant struct pages to > PG_reserved). Of course, if we have hundreds of unaligned devices and > stuff will start to add up ... but I assume this is not the case? That's precisely what we do today and it has become untenable as the collision scenarios pile up. This thread [1] is worth a read if you care about some of the gory details why I'm back to pushing for sub-section support, but most if it has already been summarized in the current discussion on this thread. [1]: https://lore.kernel.org/lkml/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com/ > > > > >> Especially when it comes to memory > >> block devices. But I am getting the feeling that we are rather targeting > >> PMEM vs. PMEM with this patch series. > > > > The collisions are between System RAM, PMEM regions, and PMEM > > namespaces (sub-divisions of regions that each need their own mapping > > lifetime). > > Understood. I wonder if that PMEM only mapping (including separate > lifetime) could be handled differently. But I am absolutely no expert, > just curious. I refer you to the above thread trying to fix the libnvdimm-local hacks. > > > > >>> Quote patch7: > >>> > >>> "The libnvdimm sub-system has suffered a series of hacks and broken > >>> workarounds for the memory-hotplug implementation's awkward > >>> section-aligned (128MB) granularity. For example the following backtrace > >>> is emitted when attempting arch_add_memory() with physical address > >>> ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM) > >>> within a given section: > >>> > >>> WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0 > >>> devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200] > >>> [..] > >>> Call Trace: > >>> dump_stack+0x86/0xc3 > >>> __warn+0xcb/0xf0 > >>> warn_slowpath_fmt+0x5f/0x80 > >>> devm_memremap_pages+0x3b5/0x4c0 > >>> __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap] > >>> pmem_attach_disk+0x19a/0x440 [nd_pmem] > >>> > >>> Recently it was discovered that the problem goes beyond RAM vs PMEM > >>> collisions as some platform produce PMEM vs PMEM collisions within a > >> > >> As side-noted by Michal, I wonder if PMEM vs. PMEM cannot rather be > >> implemented "on top" of what we have right now. Or is this what we > >> already have that you call "hacks in nvdimm" code? (no NVDIMM expert, > >> sorry for the stupid questions) > > > > It doesn't work, because even if the padding was implemented 100% > > correct, which thus far has failed to be the case, the platform may > > change physical alignments from one boot to the next for a variety of > > reasons. > > Would ignoring the System RAM parts (as mentioned above) help or doesn't > it make any difference in terms of complexity? Doesn't help much, that's only one of many collision sources. > >>> given section. The libnvdimm workaround for that case revealed that the > >>> libnvdimm section-alignment-padding implementation has been broken for a > >>> long while. A fix for that long-standing breakage introduces as many > >>> problems as it solves as it would require a backward-incompatible change > >>> to the namespace metadata interpretation. Instead of that dubious route > >>> [2], address the root problem in the memory-hotplug implementation." > >>> > >>> The approach is taken is to observe that each section already maintains > >>> an array of 'unsigned long' values to hold the pageblock_flags. A single > >>> additional 'unsigned long' is added to house a 'sub-section active' > >>> bitmask. Each bit tracks the mapped state of one sub-section's worth of > >>> capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64. > >>> > >>> The implication of allowing sections to be piecemeal mapped/unmapped is > >>> that the valid_section() helper is no longer authoritative to determine > >>> if a section is fully mapped. Instead pfn_valid() is updated to consult > >>> the section-active bitmask. Given that typical memory hotplug still has > >>> deep "section" dependencies the sub-section capability is limited to > >>> 'want_memblock=false' invocations of arch_add_memory(), effectively only > >>> devm_memremap_pages() users for now. > >> > >> Ah, there it is. And my point would be, please don't ever unlock > >> something like that for want_memblock=true. Especially not for memory > >> added after boot via device drivers (add_memory()). > > > > I don't see a strong reason why not, as long as it does not regress > > existing use cases. It might need to be an opt-in for new tooling that > > is aware of finer granularity hotplug. That said, I have no pressing > > need to go there and just care about the arch_add_memory() capability > > for now. > > Especially onlining/offlining of memory might end up very ugly. And that > goes hand in hand with memory block devices. They are either online or > offline, not something in between. (I went that path and Michal > correctly told me why it is not a good idea) Thread reference? > I was recently trying to teach memory block devices who their owner is / > of which type they are. Right now I am looking into the option of using > drivers. Memory block devices that could belong to different drivers at > a time are well ... totally broken. Sub-section support is aimed at a similar case where different portions of an 128MB span need to handed out to devices / drivers with independent lifetimes. > I assume it would still be a special > case, though, but conceptually speaking about the interface it would be > allowed. > > Memory block devices (and therefore 1..X sections) should have one owner > only. Anything else just does not fit. Yes, but I would say the problem there is that the memory-block-devices interface design is showing its age and is being pressured with how systems want to deploy and use memory today.