From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 81625C43461 for ; Wed, 7 Apr 2021 10:31:42 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C4421613A3 for ; Wed, 7 Apr 2021 10:31:41 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C4421613A3 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 204846B007E; Wed, 7 Apr 2021 06:31:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1B2516B0080; Wed, 7 Apr 2021 06:31:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F22536B0081; Wed, 7 Apr 2021 06:31:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0240.hostedemail.com [216.40.44.240]) by kanga.kvack.org (Postfix) with ESMTP id CFF216B007E for ; Wed, 7 Apr 2021 06:31:40 -0400 (EDT) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 862988154 for ; Wed, 7 Apr 2021 10:31:40 +0000 (UTC) X-FDA: 78005204760.07.6751AF9 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf08.hostedemail.com (Postfix) with ESMTP id E3FD980192FC for ; Wed, 7 Apr 2021 10:31:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1617791499; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=nJSxvPf+2fQxpcou71SA/t4hQw4etrSkU0eumEfbqCc=; b=OWh1PJaFA7Wu8da04SBF2WuWsiep0QyvMzG+IFAYt8oRmtkbOX3V3EyJ8l/d6eW1/0E9sw tl9JmtGggd2C+nWrINFaPdNrU8WF33WEiXe+KNZ724vGt/Ysvwyiz9DMS4b/f4+33Ik/ty UdAtqWZIr5tKF+vpDvcYc2YPZcHe0B8= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-521-SMXER1lrNQy7RmmVKzJMWQ-1; Wed, 07 Apr 2021 06:31:35 -0400 X-MC-Unique: SMXER1lrNQy7RmmVKzJMWQ-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 2D4221006C82; Wed, 7 Apr 2021 10:31:30 +0000 (UTC) Received: from [10.36.114.68] (ovpn-114-68.ams2.redhat.com [10.36.114.68]) by smtp.corp.redhat.com (Postfix) with ESMTP id 624C871284; Wed, 7 Apr 2021 10:31:13 +0000 (UTC) From: David Hildenbrand To: Jann Horn Cc: kernel list , Linux-MM , Andrew Morton , Arnd Bergmann , Michal Hocko , Oscar Salvador , Matthew Wilcox , Andrea Arcangeli , Minchan Kim , Jason Gunthorpe , Dave Hansen , Hugh Dickins , Rik van Riel , "Michael S . Tsirkin" , "Kirill A . Shutemov" , Vlastimil Babka , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Chris Zankel , Max Filippov , Mike Kravetz , Peter Xu , Rolf Eike Beer , linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linux-xtensa@linux-xtensa.org, linux-arch , Linux API References: <20210317110644.25343-1-david@redhat.com> <20210317110644.25343-3-david@redhat.com> <2bab28c7-08c0-7ff0-c70e-9bf94da05ce1@redhat.com> <26227fc6-3e7b-4e69-f69d-4dc2a67ecfe8@redhat.com> <54165ffe-dbf7-377a-a710-d15be4701f20@redhat.com> Organization: Red Hat GmbH Subject: Re: [PATCH v1 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory Message-ID: <5f49b60c-957d-8cb4-de7a-7c855dc72942@redhat.com> Date: Wed, 7 Apr 2021 12:31:11 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <54165ffe-dbf7-377a-a710-d15be4701f20@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Stat-Signature: sdwi1t3tuwsjr3qhdgrhfq4mu8ssxrz5 X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: E3FD980192FC Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf08; identity=mailfrom; envelope-from=""; helo=us-smtp-delivery-124.mimecast.com; client-ip=216.205.24.124 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1617791492-80833 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 30.03.21 18:31, David Hildenbrand wrote: > On 30.03.21 18:30, David Hildenbrand wrote: >> On 30.03.21 18:21, Jann Horn wrote: >>> On Tue, Mar 30, 2021 at 5:01 PM David Hildenbrand = wrote: >>>>>> +long faultin_vma_page_range(struct vm_area_struct *vma, unsigned = long start, >>>>>> + unsigned long end, bool write, int *lo= cked) >>>>>> +{ >>>>>> + struct mm_struct *mm =3D vma->vm_mm; >>>>>> + unsigned long nr_pages =3D (end - start) / PAGE_SIZE; >>>>>> + int gup_flags; >>>>>> + >>>>>> + VM_BUG_ON(!PAGE_ALIGNED(start)); >>>>>> + VM_BUG_ON(!PAGE_ALIGNED(end)); >>>>>> + VM_BUG_ON_VMA(start < vma->vm_start, vma); >>>>>> + VM_BUG_ON_VMA(end > vma->vm_end, vma); >>>>>> + mmap_assert_locked(mm); >>>>>> + >>>>>> + /* >>>>>> + * FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT whe= n we hit >>>>>> + * a poisoned page. >>>>>> + * FOLL_POPULATE: Always populate memory with VM_LOCKONFAU= LT. >>>>>> + * !FOLL_FORCE: Require proper access permissions. >>>>>> + */ >>>>>> + gup_flags =3D FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | FO= LL_HWPOISON; >>>>>> + if (write) >>>>>> + gup_flags |=3D FOLL_WRITE; >>>>>> + >>>>>> + /* >>>>>> + * See check_vma_flags(): Will return -EFAULT on incompati= ble mappings >>>>>> + * or with insufficient permissions. >>>>>> + */ >>>>>> + return __get_user_pages(mm, start, nr_pages, gup_flags, >>>>>> + NULL, NULL, locked); >>>>> >>>>> You mentioned in the commit message that you don't want to actually >>>>> dirty all the file pages and force writeback; but doesn't >>>>> POPULATE_WRITE still do exactly that? In follow_page_pte(), if >>>>> FOLL_TOUCH and FOLL_WRITE are set, we mark the page as dirty: >>>> >>>> Well, I mention that POPULATE_READ explicitly doesn't do that. I >>>> primarily set it because populate_vma_page_range() also sets it. >>>> >>>> Is it safe to *not* set it? IOW, fault something writable into a pag= e >>>> table (where the CPU could dirty it without additional page faults) >>>> without marking it accessed? For me, this made logically sense. Thus= I >>>> also understood why populate_vma_page_range() set it. >>> >>> FOLL_TOUCH doesn't have anything to do with installing the PTE - it >>> essentially means "the caller of get_user_pages wants to read/write >>> the contents of the returned page, so please do the same things you >>> would do if userspace was accessing the page". So in particular, if >>> you look up a page via get_user_pages() with FOLL_WRITE|FOLL_TOUCH, >>> that tells the MM subsystem "I will be writing into this page directl= y >>> from the kernel, bypassing the userspace page tables, so please mark >>> it as dirty now so that it will be properly written back later". Part >>> of that is that it marks the page as recently used, which has an >>> effect on LRU pageout behavior, I think - as far as I understand, tha= t >>> is why populate_vma_page_range() uses FOLL_TOUCH. >>> >>> If you look at __get_user_pages(), you can see that it is split up >>> into two major parts: faultin_page() for creating PTEs, and >>> follow_page_mask() for grabbing pages from PTEs. faultin_page() >>> ignores FOLL_TOUCH completely; only follow_page_mask() uses it. >>> >>> In a way I guess maybe you do want the "mark as recently accessed" >>> part that FOLL_TOUCH would give you without FOLL_WRITE? But I think >>> you very much don't want the dirtying that FOLL_TOUCH|FOLL_WRITE lead= s >>> to. Maybe the ideal approach would be to add a new FOLL flag to say "= I >>> only want to mark as recently used, I don't want to dirty". Or maybe >>> it's enough to just leave out the FOLL_TOUCH entirely, I don't know. >> >> Any thoughts why populate_vma_page_range() does it? >=20 > Sorry, I missed the explanation above - thanks! Looking into the details, adjusting the FOLL_TOUCH logic won't make too=20 much of a difference for MADV_POPULATE_WRITE I guess. AFAIKs, the=20 biggest impact of FOLL_TOUCH is actually with FOLL_FORCE - which we are=20 not using, but populate_vma_page_range() is. If a page was not faulted in yet,=20 faultin_page(FOLL_WRITE)->handle_mm_fault(FAULT_FLAG_WRITE) will already=20 mark the PTE/PMD/... dirty and accessed. One example is=20 handle_pte_fault(). We will mark the page accessed again via FOLL_TOUCH,=20 which doesn't seem to be strictly required. If the page was already faulted in, we have three cases: 1. Page faulted in writable. The page should already be dirty (otherwise=20 we would be in trouble I guess). We will mark it accessed. 2. Page faulted in readable. handle_mm_fault() will fault it in writable=20 and set the page dirty. 3. Page faulted in readable and we have FOLL_FORCE. We mark the page=20 dirty and accessed. So doing a MADV_POPULATE_WRITE, whereby we prefault page tables=20 writable, doesn't seem to fly without marking the pages dirty. That's=20 one reason why I included MADV_POPULATE_READ. We could a) Drop FOLL_TOUCH. We are not marking the page accessed, which would=20 mean it gets evicted rather earlier than later. b) Introduce FOLL_ACCESSED which won't do the dirtying. But then, the=20 pages are already dirty as explained above, so there isn't a real=20 observable change. c) Keep it as is: Mark the page accessed and dirty. As it's already=20 dirty, that does not seem to be a real issue. Am I missing something obvious? Thanks! --=20 Thanks, David / dhildenb