From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.3 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2C574CA9EAF for ; Thu, 24 Oct 2019 12:51:09 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C5E8A205C9 for ; Thu, 24 Oct 2019 12:51:08 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Mrt1yQZl" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C5E8A205C9 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 700E66B0005; Thu, 24 Oct 2019 08:51:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6B22B6B0006; Thu, 24 Oct 2019 08:51:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 578F16B0007; Thu, 24 Oct 2019 08:51:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0038.hostedemail.com [216.40.44.38]) by kanga.kvack.org (Postfix) with ESMTP id 2FE726B0005 for ; Thu, 24 Oct 2019 08:51:08 -0400 (EDT) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id CA369181AF5C3 for ; Thu, 24 Oct 2019 12:51:07 +0000 (UTC) X-FDA: 76078663374.19.owl77_440521d283c17 X-HE-Tag: owl77_440521d283c17 X-Filterd-Recvd-Size: 10555 Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [205.139.110.120]) by imf48.hostedemail.com (Postfix) with ESMTP for ; Thu, 24 Oct 2019 12:51:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1571921466; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=VqSy8TCDLSQZcjVM8MQ9Z6hcBHgAy+Iy/TvGqZXo7OA=; b=Mrt1yQZlf1ieHEuWSJnd+2ZtaTRGGPFjzLzU9jOLRcO/4zGqg1TRvvxF8LHNhdthdioq42 hVpJIBYWxn0zSLEkij6g2x875/ydSsf2Q2vNOJV5CXGtqxNWyZOIbReKUzARHtQSBgz2V0 8dR2Gfko41eJG8riNl6QWIXGHeFhP7k= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-146-JiSCvq7dO9yuvbmKAN7FCA-1; Thu, 24 Oct 2019 08:51:04 -0400 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 8F41D80183D; Thu, 24 Oct 2019 12:50:59 +0000 (UTC) Received: from [10.36.116.141] (ovpn-116-141.ams2.redhat.com [10.36.116.141]) by smtp.corp.redhat.com (Postfix) with ESMTP id A6B315D70E; Thu, 24 Oct 2019 12:50:39 +0000 (UTC) Subject: Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) From: David Hildenbrand To: Dan Williams Cc: Linux Kernel Mailing List , Linux MM , Michal Hocko , Andrew Morton , kvm-ppc@vger.kernel.org, linuxppc-dev , KVM list , linux-hyperv@vger.kernel.org, devel@driverdev.osuosl.org, xen-devel , X86 ML , Alexander Duyck , Kees Cook , Alex Williamson , Allison Randal , Andy Lutomirski , "Aneesh Kumar K.V" , Anshuman Khandual , Anthony Yznaga , Ben Chan , Benjamin Herrenschmidt , Borislav Petkov , Boris Ostrovsky , Christophe Leroy , Cornelia Huck , Dan Carpenter , Dave Hansen , Fabio Estevam , Greg Kroah-Hartman , Haiyang Zhang , "H. Peter Anvin" , Ingo Molnar , "Isaac J. Manjarres" , Jeremy Sowden , Jim Mattson , Joerg Roedel , Johannes Weiner , Juergen Gross , KarimAllah Ahmed , Kate Stewart , "K. Y. Srinivasan" , Madhumitha Prabakaran , Matt Sickler , Mel Gorman , Michael Ellerman , Michal Hocko , Mike Rapoport , Mike Rapoport , Nicholas Piggin , Nishka Dasgupta , Oscar Salvador , Paolo Bonzini , Paul Mackerras , Paul Mackerras , Pavel Tatashin , Pavel Tatashin , Peter Zijlstra , Qian Cai , =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , Rob Springer , Sasha Levin , Sean Christopherson , =?UTF-8?Q?Simon_Sandstr=c3=b6m?= , Stefano Stabellini , Stephen Hemminger , Thomas Gleixner , Todd Poynor , Vandana BN , Vitaly Kuznetsov , Vlastimil Babka , Wanpeng Li , YueHaibing References: <20191022171239.21487-1-david@redhat.com> Organization: Red Hat GmbH Message-ID: Date: Thu, 24 Oct 2019 14:50:38 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.1.1 MIME-Version: 1.0 In-Reply-To: Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-MC-Unique: JiSCvq7dO9yuvbmKAN7FCA-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 23.10.19 09:26, David Hildenbrand wrote: > On 22.10.19 23:54, Dan Williams wrote: >> Hi David, >> >> Thanks for tackling this! >=20 > Thanks for having a look :) >=20 > [...] >=20 >=20 >>> I am probably a little bit too careful (but I don't want to break thing= s). >>> In most places (besides KVM and vfio that are nuts), the >>> pfn_to_online_page() check could most probably be avoided by a >>> is_zone_device_page() check. However, I usually get suspicious when I s= ee >>> a pfn_valid() check (especially after I learned that people mmap parts = of >>> /dev/mem into user space, including memory without memmaps. Also, peopl= e >>> could memmap offline memory blocks this way :/). As long as this does n= ot >>> hurt performance, I think we should rather do it the clean way. >> >> I'm concerned about using is_zone_device_page() in places that are not >> known to already have a reference to the page. Here's an audit of >> current usages, and the ones I think need to cleaned up. The "unsafe" >> ones do not appear to have any protections against the device page >> being removed (get_dev_pagemap()). Yes, some of these were added by >> me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device >> pages into anonymous memory paths and I'm not up to speed on how it >> guarantees 'struct page' validity vs device shutdown without using >> get_dev_pagemap(). >> >> smaps_pmd_entry(): unsafe >> >> put_devmap_managed_page(): safe, page reference is held >> >> is_device_private_page(): safe? gpu driver manages private page lifetime >> >> is_pci_p2pdma_page(): safe, page reference is held >> >> uncharge_page(): unsafe? HMM >> >> add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page() >> >> soft_offline_page(): unsafe >> >> remove_migration_pte(): unsafe? HMM >> >> move_to_new_page(): unsafe? HMM >> >> migrate_vma_pages() and helpers: unsafe? HMM >> >> try_to_unmap_one(): unsafe? HMM >> >> __put_page(): safe >> >> release_pages(): safe >> >> I'm hoping all the HMM ones can be converted to >> is_device_private_page() directlly and have that routine grow a nice >> comment about how it knows it can always safely de-reference its @page >> argument. >> >> For the rest I'd like to propose that we add a facility to determine >> ZONE_DEVICE by pfn rather than page. The most straightforward why I >> can think of would be to just add another bitmap to mem_section_usage >> to indicate if a subsection is ZONE_DEVICE or not. >=20 > (it's a somewhat unrelated bigger discussion, but we can start discussing= it in this thread) >=20 > I dislike this for three reasons >=20 > a) It does not protect against any races, really, it does not improve thi= ngs. > b) We do have the exact same problem with pfn_to_online_page(). As long a= s we > don't hold the memory hotplug lock, memory can get offlined and remov= e any time. Racy. > c) We mix in ZONE specific stuff into the core. It should be "just anothe= r zone" >=20 > What I propose instead (already discussed in https://lkml.org/lkml/2019/1= 0/10/87) >=20 > 1. Convert SECTION_IS_ONLINE to SECTION_IS_ACTIVE > 2. Convert SECTION_IS_ACTIVE to a subsection bitmap > 3. Introduce pfn_active() that checks against the subsection bitmap > 4. Once the memmap was initialized / prepared, set the subsection active > (similar to SECTION_IS_ONLINE in the buddy right now) > 5. Before the memmap gets invalidated, set the subsection inactive > (similar to SECTION_IS_ONLINE in the buddy right now) > 5. pfn_to_online_page() =3D pfn_active() && zone !=3D ZONE_DEVICE > 6. pfn_to_device_page() =3D pfn_active() && zone =3D=3D ZONE_DEVICE >=20 Dan, I am suspecting that you want a pfn_to_zone() that will not touch=20 the memmap, because it could potentially (altmap) lie on slow memory, right= ? A modification might make this possible (but I am not yet sure if we=20 want a less generic MM implementation just to fine tune slow memmap=20 access here) 1. Keep SECTION_IS_ONLINE as it is with the same semantics 2. Introduce a subsection bitmap to record active ("initialized memmap") PFNs. E.g., also set it when setting sections online. 3. Introduce pfn_active() that checks against the subsection bitmap 4. Once the memmap was initialized / prepared, set the subsection active (similar to SECTION_IS_ONLINE in the buddy right now) 5. Before the memmap gets invalidated, set the subsection inactive (similar to SECTION_IS_ONLINE in the buddy right now) 5. pfn_to_online_page() =3D pfn_active() && section =3D=3D SECTION_IS_ONLIN= E (or keep it as is, depends on the RCU locking we eventually implement) 6. pfn_to_device_page() =3D pfn_active() && section !=3D SECTION_IS_ONLINE 7. use pfn_active() whenever we don't care about the zone. Again, not really a friend of that, it hardcodes ZONE_DEVICE vs.=20 !ZONE_DEVICE. When we do a random "pfn_to_page()" (e.g., a pfn walker)=20 we really want to touch the memmap right away either way. So we can also=20 directly read the zone from it. I really do prefer right now a more=20 generic implementation. --=20 Thanks, David / dhildenb