From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7D711C001E0 for ; Mon, 31 Jul 2023 19:00:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CC54B280098; Mon, 31 Jul 2023 15:00:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C745C28007A; Mon, 31 Jul 2023 15:00:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B1572280098; Mon, 31 Jul 2023 15:00:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 9E59728007A for ; Mon, 31 Jul 2023 15:00:16 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 3824A1A0AE5 for ; Mon, 31 Jul 2023 19:00:15 +0000 (UTC) X-FDA: 81072822390.29.A9AAC8B Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf16.hostedemail.com (Postfix) with ESMTP id AF51F18001F for ; Mon, 31 Jul 2023 19:00:12 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=GLiLyeHx; spf=pass (imf16.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690830012; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=LRN4iXZUggA9gxv0XWtRGIRu1LUQCEe4QBtdGPEZ4EM=; b=QM2iMc/nsWrIZHZ0/+cX0Nr+FZKA73LoEC0/THkjYdFz70V8883XWjjSKyhgmXNa+11d7h QFjBQ8I/kgZ6oS7TJUexhksO9y/x34xsBlD6k2n+z140j8fiFur2kZyKtzKtEHFsvVrtMh v0RgllFlhoxpwqB9WUgIUAlaiUwKtUY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690830012; a=rsa-sha256; cv=none; b=FA3LORmYN2r72aN++rLqKjumGPHUNmr56IkwlPHdc3tV1yvHy5KbFkL8ecz9cPNN35SQ1Y xEA4Rvwcs9zuFoKNPHYWBzBE4CGXMIoRTb1IH+n4YQoDvu8m6wLD76YiZTX+i/GIbYy7Yz OPpgb2AX/IdgJ2nrixHdVP5iMIBj7u0= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=GLiLyeHx; spf=pass (imf16.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1690830011; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=LRN4iXZUggA9gxv0XWtRGIRu1LUQCEe4QBtdGPEZ4EM=; b=GLiLyeHxZvEpcuD1JlIKpBSu0MyPbQItzIFz8oyqCOgqbHhop2gXFFm9jAeyd8105hP5uc jP3FyObXAoVBl95zsHizFBhIda6EASKuO4FxdI/R6GRZJsr3PWKgfQbaurbcpGEN6y9wLQ NpSt2Stl70MLJ/oqt0HGWp/Gc783omg= Received: from mail-wr1-f72.google.com (mail-wr1-f72.google.com [209.85.221.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-493-vc9uLpNgNGaoIU2tl21ySg-1; Mon, 31 Jul 2023 15:00:09 -0400 X-MC-Unique: vc9uLpNgNGaoIU2tl21ySg-1 Received: by mail-wr1-f72.google.com with SMTP id ffacd0b85a97d-3113da8b778so2394367f8f.3 for ; Mon, 31 Jul 2023 12:00:09 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690830008; x=1691434808; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=LRN4iXZUggA9gxv0XWtRGIRu1LUQCEe4QBtdGPEZ4EM=; b=HTUwfq7gIyketNeIZ1ibBesBvBoBrCf95BaB1yyikYBfb6v9V5DD9BqwRGKM05YYEY b17BWeH6UYZyLlEkYTmZ98vY8zquaG3jntPrFkH8VWiicn9lvyBau0QYEaNtBPEKNiCv hgt59Wh20ELHpLVoS1NJy4q1hLnOWlHt2zS8blZxP2PJPCLJstDJUtxn1nIMWVPA8mwm AvFCh+vYR1NQHvZzjQF6fDANSvURRnAG2FNpXe2EFxRAi+kLGGNtneVkj/DUHITZl4nh 5lRt2er2LI9vjA7tGabA+3fip7wRMm4Qu8Enpezk8z2CLmuQBIvGAP5EkwqrmEyAHoUk U5xQ== X-Gm-Message-State: ABy/qLbwvGA/PbpUb7EzTpux6eJWT/1LqTedKY8bVul8ANCJI1nuyLa2 yzE9tOQotJBtMhGkrPjjzea4NwifzJzlK/pDCnqB/Vf+tTphxXa+owbWToUb0efiEEOlOD1jwL4 XIeuZOP+/c5w= X-Received: by 2002:adf:dd8a:0:b0:317:6e62:b124 with SMTP id x10-20020adfdd8a000000b003176e62b124mr423308wrl.18.1690830008405; Mon, 31 Jul 2023 12:00:08 -0700 (PDT) X-Google-Smtp-Source: APBJJlFqiX9tpRijhcAmgDWj/xfoH3k7N2YHmd2jP9GAIEyCL0XqSLGXqpeGNftPDRTmLgmfpte+Vg== X-Received: by 2002:adf:dd8a:0:b0:317:6e62:b124 with SMTP id x10-20020adfdd8a000000b003176e62b124mr423280wrl.18.1690830007915; Mon, 31 Jul 2023 12:00:07 -0700 (PDT) Received: from ?IPV6:2003:cb:c723:4c00:5c85:5575:c321:cea3? (p200300cbc7234c005c855575c321cea3.dip0.t-ipconnect.de. [2003:cb:c723:4c00:5c85:5575:c321:cea3]) by smtp.gmail.com with ESMTPSA id z7-20020a5d4407000000b0031766e99429sm13799820wrq.115.2023.07.31.12.00.06 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 31 Jul 2023 12:00:07 -0700 (PDT) Message-ID: Date: Mon, 31 Jul 2023 21:00:06 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 To: Linus Torvalds Cc: Peter Xu , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Andrew Morton , liubo , Matthew Wilcox , Hugh Dickins , Jason Gunthorpe , John Hubbard , Mel Gorman References: <20230727212845.135673-1-david@redhat.com> <412bb30f-0417-802c-3fc4-a4e9d5891c5d@redhat.com> <66e26ad5-982e-fe2a-e4cd-de0e552da0ca@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Stat-Signature: zu6z941o4f98y67cg3oz98hcy1jg7ecs X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: AF51F18001F X-Rspam-User: X-HE-Tag: 1690830012-148883 X-HE-Meta: U2FsdGVkX184uKET/VE9RJ0OaKkV4ci6EQJnlO8DDTOyQCn7PCVPCkfGUHLVA1BlYCoSetjm7TY5frDyGvmw1jj1TmDEHC8IHi73PJ+1Ted5vXd3CuzyWp6ozSAW0jMr5KxAF8DpCYkxD7TnCGMfvXliK/azD9MxCi1EMnz7LombX/ci5IKbRk7F1Fn73LWOYXM75YLCaPLRm3ZLGn2Sg/cShc7VYOuyKh0UppOOFiVqeFbGLx3fEOwchnpejMFEJbS+mXYRpQv1NApzprktDYEyZ3imp5XQz6RKi/lJoDZbhVAxY6CSbqeEUWpWb94/x8cBai0KK+F+oyhz7VMbkOO5mWMPBcpLzfrgbdf+urd2uHgHwoaHcIv6S5GBGW41kqTCm95ObGkVybTR0aCEeFcBiNil/ilM+gquTpKpWF4S4Ih4X4U8zGEhH761nj2Jm1VeQRBhzcvHCbUqsD5N5oc09wrD1VZXSm49N3KFr3XCCo/Nd+dMXN5GnGDXCh2dGbVAxpQEnIvAWiFXMkTtqnpo6q+JHLEglN+EMprVO7eCFfNJ6P0K0S1Ex1Goz6BrcNnkcNTzAcpGqQnkJkli4K9K9lRrCbpJsMGmDozdouCifPqQCLvW3kTqZ5Ax5r090YYpJ6mITaYru6FWD6SYeEb444YwuzW3el9wouudrxz61y3gvmtKeF53V01m5A418QquSE1IbZ5g1Rzu+oHBcSxaOelzWX12tFPPTrJDopswzwpgWTjLKr1GvV/CzTmPtIMIcOvzdlthfD62u4GmHulWxhKG2fJiX5eAjsVwhb7RDW9NGImC9yx0MOxttj87skI/yDkxZXMmCsN832VC9bS/nUaA9l0OC+mSYPNlt8tk2l6WYOSRYvJEajOL1RfpmzmJfrnIPQ1otOKJYqkFng6XVbmUKUZUZsPrnoEOFD3uqDCSq5JUx889fGXaud9hUm1F75RaP50t/WZzJqK 9rlV6hOA RTh6K+kDRkbkT/K0XQsWYzh65w9nRd2LApBTh+7TitBvQwonqHvassuVsffl6xq4V67pE7EmijXzu47PK6yw4mYzAvXCQL7qT0evHfypXuZiUVowhnI21fPIzPYKpfLDnABToCE9peO7gzhsWoJuAgtQy1QgWM5jEq+PEpmXSX/GVCkpSfSiR3xkbv1otgaB+EQeCA6ehPkc5+8UpfApE1V9HJVO9kmSrHsF98bSRIFIWRZNhzkcCWq6de/5Nxv+HkbsE4fp00Yigo4dKWeXVHfayyEbVzQamMd0qh9gSxKT4HXK/vGQi19DvOWKYYFnL7GLSpanvOGGdK7FXjhOYjBbGSw8x2IKorZErF1XwKPSshCw3qU+4pwCmSasOjMJI4Ue5XUOKAtt5UFx36VBee5TJQBMBCI8VN6WnPeYqiNr3uE8kaNUlTuckoMpyIlJDjOxZoTWF6GmEZEM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 31.07.23 20:23, Linus Torvalds wrote: > On Mon, 31 Jul 2023 at 09:20, David Hildenbrand wrote: >> Hi Linus, >> I modified it slightly: FOLL_HONOR_NUMA_FAULT is now set in >> is_valid_gup_args(), such that it will always be set for any GUP users, >> including GUP-fast. > > But do we actually want that? It is actively crazy to honor NUMA > faulting at least for get_user_pages_remote(). This would only be for the stable backport that would go in first and where I want to be a bit careful. Next step would be to let the callers (KVM) specify FOLL_HONOR_NUMA_FAULT, as suggested by you. > > So right now, GUP-fast requires us to honor NUMA faults, because > GUP-fast doesn't have a vma (which in turn is because GUP-fast doesn't > take any locks). With FOLL_HONOR_NUMA_FAULT moved to the GUP caller that would no longer be the case. Anybody who (1) doesn't specify FOLL_HONOR_NUMA_FAULT, which is the majority (2) doesn't specify FOLL_WRITE Would get GUP-fast just grabbing these pte_protnone() pages. > > So GUP-fast can only look at the page table data, and as such *has* to > fail if the page table is inaccessible. gup_fast_only, yes, which is what KVM uses if a writable PFN is desired. > > But GUP in general? Why would it want to honor numa faulting? > Particularly by default, and _particularly_ for things like > FOLL_REMOTE. KVM currently does [virt/kvm/kvm_main.c]: (1) hva_to_pfn_fast(): call get_user_page_fast_only(FOLL_WRITE) if a writable PFN is desired (2) hva_to_pfn_slow(): call get_user_pages_unlocked() So in the "!writable" case, we would always call get_user_pages_unlocked() and never honor NUMA faults. Converting that to some other pattern might be possible (although KVM plays quite some tricks here!), but assuming we would always first do a get_user_page_fast_only(), then when not intending to write (!FOLL_WRITE) (1) get_user_page_fast_only() would honor NUMA faults and fail (2) get_user_pages() would not honor NUMA faults and succeed Hmmm ... so we would have to use get_user_pages_fast()? It might be possible, but I am not sure if we want get_user_pages_fast() to always honor NUMA faults, because ... > > In fact, I feel like this is what the real rule should be: we simply > define that get_user_pages_fast() is about looking up the page in the > page tables. > > So if you want something that acts like a page table lookup, you use > that "fast" thing. It's literally how it is designed. The whole - and > pretty much only - point of it is that it can be used with no locking > at all, because it basically acts like the hardware lookup does. > ... I see what you mean (HW would similarly refuse to use such a page), but I do wonder if that makes the API clearer and if this is what we actually want. We do have callers of pin_user_pages_fast() and friends that maybe *really* shouldn't care about NUMA hinting. iov_iter_extract_user_pages() is one example -- used for O_DIRECT nowadays. Their logic is "if it's directly in the page table, create, hand it over. If not, please go the slow path.". In many cases user space just touched these pages so they are very likely in the page table. Converting them to pin_user_pages() would mean they will just run slower in the common case. Converting them to a manual pin_user_pages_fast_only() + pin_user_pages() doesn't seem very compelling. ... so we would need a new API? :/ > So then if KVM wants to look up a page in the page table, that is what > kvm should use, and it automatically gets the "honor numa faults" > behavior, not because it sets a magic flag, but simply because that is > how GUP-fast *works*. > > But if you use the "normal" get/pin_user_pages() function, which looks > up the vma, at that point you are following things at a "software > level", and it wouldn't do NUMA faulting, it would just get the page. My main problem with that is that pin_user_pages_fast() and friends are used all over the place for a "likely already in the page table case, so just make everything faster as default". Always honoring NUMA faults here does not sound like the improvement we wanted to have :) ... we actually *don't* want to honor NUMA faults here. We just have to find a way to make the KVM special case happy. Thanks! -- Cheers, David / dhildenb