From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9A9D8C4332F for ; Wed, 9 Nov 2022 10:35:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 254436B0072; Wed, 9 Nov 2022 05:35:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1DD0D8E0002; Wed, 9 Nov 2022 05:35:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 059EA8E0001; Wed, 9 Nov 2022 05:35:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id E660E6B0072 for ; Wed, 9 Nov 2022 05:35:54 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id B5EF51A1517 for ; Wed, 9 Nov 2022 10:35:54 +0000 (UTC) X-FDA: 80113548228.28.A99DE7A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf05.hostedemail.com (Postfix) with ESMTP id 6969510000D for ; Wed, 9 Nov 2022 10:35:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1667990090; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3J441/GDgv6cN9YzwYIw/Qlxf96gs8QTcgGQhe6EZYE=; b=euk6gWjpq8fVIjrKvj8dvoXERu0cUAiVRCb36v/lB5EXc7Ya8jj3YfWyG2AxIC1N28X9Xz X+yRuwa2b+PXAH9+5fo42a0fP24DfHAWcZ/EiP81LUV6Ei+GDRzPEwn2AsT+Rvbj1YdLta dhvQRFKqooZN6kJ2uYr7c39M2fDNsoE= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-207-UEskUx74OwWjb63KODZmkg-1; Wed, 09 Nov 2022 05:34:47 -0500 X-MC-Unique: UEskUx74OwWjb63KODZmkg-1 Received: by mail-wm1-f69.google.com with SMTP id m17-20020a05600c3b1100b003cf9cc47da5so5832610wms.9 for ; Wed, 09 Nov 2022 02:34:47 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=3J441/GDgv6cN9YzwYIw/Qlxf96gs8QTcgGQhe6EZYE=; b=oLv1weVnfza/G/qORBT3ZPeyw5Eol2nrD9PsV8L1dnrvb9Exy5RqmMQ8p7/JiPxD8T Q5iRewdtIZA1nNrZqQ1ex2ov6ccAwDzx636HV4yEntx10PN5PEMuRsl9v7Nx/XxWzpBZ TBmNffK9pto0umnabRMSJReVPV21awECnpPwue3vFXdM/f7QNTU+rlN8Lne07WRNLAo3 wszs5LurTGMHNEVI9U6ItB2Cnq3WEwQ4yDgoxnLBnf07sZKbzelYhrrZIf5j6R0Ps41w yH0gaf7NWgGpd1Tj8qebm52LatWwJgAyTyu6gZTzCe2JMfBK+xE2Ce6G53xaAeaeo7Vm /hfA== X-Gm-Message-State: ACrzQf1sha2wNc9EyBEptw5Ft0x9SusmHnBqagiwybXlunFjJRs0M7/o WD8byx7iXNs9zBJvQn+BImBjfgHytUj7qGmZUuTRcoCvn3hns3wq17t/S+dsBhL0wQENBWDWiLI Rww0mwSmvkcs= X-Received: by 2002:a1c:f214:0:b0:3be:4e7c:1717 with SMTP id s20-20020a1cf214000000b003be4e7c1717mr40707882wmc.171.1667990086103; Wed, 09 Nov 2022 02:34:46 -0800 (PST) X-Google-Smtp-Source: AMsMyM5KUj/gYNJcSIhtdNPjm0EteMyf5apuMBhpsiXhBbhX54nb3tAFmta3JZlLl0rVB/VwRx9Lrw== X-Received: by 2002:a1c:f214:0:b0:3be:4e7c:1717 with SMTP id s20-20020a1cf214000000b003be4e7c1717mr40707846wmc.171.1667990085742; Wed, 09 Nov 2022 02:34:45 -0800 (PST) Received: from ?IPV6:2003:cb:c704:b000:3b0e:74a3:bc8:9937? (p200300cbc704b0003b0e74a30bc89937.dip0.t-ipconnect.de. [2003:cb:c704:b000:3b0e:74a3:bc8:9937]) by smtp.gmail.com with ESMTPSA id h4-20020a05600c350400b003c6f426467fsm1194121wmq.40.2022.11.09.02.34.44 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 09 Nov 2022 02:34:45 -0800 (PST) Message-ID: <9c167d01-ef09-ec4e-b4a1-2fff62bf01fe@redhat.com> Date: Wed, 9 Nov 2022 11:34:43 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.4.0 To: Muhammad Usama Anjum , =?UTF-8?B?TWljaGHFgiBNaXJvc8WCYXc=?= , Andrei Vagin , Danylo Mocherniuk , Alexander Viro , Andrew Morton , Suren Baghdasaryan , Greg KH , Christian Brauner , Peter Xu , Yang Shi , Vlastimil Babka , Zach O'Keefe , "Matthew Wilcox (Oracle)" , "Gustavo A. R. Silva" , Dan Williams , kernel@collabora.com, Gabriel Krisman Bertazi , Peter Enderborg , "open list : KERNEL SELFTEST FRAMEWORK" , Shuah Khan , open list , "open list : PROC FILESYSTEM" , "open list : MEMORY MANAGEMENT" , Paul Gofman References: <20221109102303.851281-1-usama.anjum@collabora.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v6 0/3] Implement IOCTL to get and/or the clear info about PTEs In-Reply-To: <20221109102303.851281-1-usama.anjum@collabora.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=euk6gWjp; spf=pass (imf05.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1667990146; a=rsa-sha256; cv=none; b=QaNYgkWbt9f7A6rkwOiFeNWWVo38XsIGizCH4ilRTSPjrJE34R8AUswVWpzB7TV28vRw2E e10pFJvYYSXRf6rTk3r2ikk0AAbLvocByr9UeanBuxvq8/umYgXJ8VQ4ur0CiucfqUE8b3 dvmuvJB+PnUsB/pXy82pjSLRP+loZks= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1667990146; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3J441/GDgv6cN9YzwYIw/Qlxf96gs8QTcgGQhe6EZYE=; b=d0dysSvlTGDA/dfASEyjzzZT5+fJsQDSPumyN7O97rnfm9J8KY16VXuin3KL7f6BgaSqSy Zyw+sXx5xaKgkWwCQ52JQs/jetnamPSO85kXBDAKm4+TY84kcukbDisbDYUgF4Z3hrlJEM Kl3aX7zHSCkEBn7VEynlcfeHeGBNqjI= X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 6969510000D X-Rspam-User: Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=euk6gWjp; spf=pass (imf05.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Stat-Signature: go1ccznhwho3ttu8k63ozfibyyikbxkm X-HE-Tag: 1667990146-727723 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 09.11.22 11:23, Muhammad Usama Anjum wrote: > Changes in v6: > - Updated the interface and made cosmetic changes > > Original Cover Letter in v5: > Hello, > > This patch series implements IOCTL on the pagemap procfs file to get the > information about the page table entries (PTEs). The following operations > are supported in this ioctl: > - Get the information if the pages are soft-dirty, file mapped, present > or swapped. > - Clear the soft-dirty PTE bit of the pages. > - Get and clear the soft-dirty PTE bit of the pages atomically. > > Soft-dirty PTE bit of the memory pages can be read by using the pagemap > procfs file. The soft-dirty PTE bit for the whole memory range of the > process can be cleared by writing to the clear_refs file. There are other > methods to mimic this information entirely in userspace with poor > performance: > - The mprotect syscall and SIGSEGV handler for bookkeeping > - The userfaultfd syscall with the handler for bookkeeping > Some benchmarks can be seen here[1]. This series adds features that weren't > present earlier: > - There is no atomic get soft-dirty PTE bit status and clear operation > possible. > - The soft-dirty PTE bit of only a part of memory cannot be cleared. > > Historically, soft-dirty PTE bit tracking has been used in the CRIU > project. The procfs interface is enough for finding the soft-dirty bit > status and clearing the soft-dirty bit of all the pages of a process. > We have the use case where we need to track the soft-dirty PTE bit for > only specific pages on demand. We need this tracking and clear mechanism > of a region of memory while the process is running to emulate the > getWriteWatch() syscall of Windows. This syscall is used by games to > keep track of dirty pages to process only the dirty pages. > > The information related to pages if the page is file mapped, present and > swapped is required for the CRIU project[2][3]. The addition of the > required mask, any mask, excluded mask and return masks are also required > for the CRIU project[2]. > > The IOCTL returns the addresses of the pages which match the specific masks. > The page addresses are returned in struct page_region in a compact form. > The max_pages is needed to support a use case where user only wants to get > a specific number of pages. So there is no need to find all the pages of > interest in the range when max_pages is specified. The IOCTL returns when > the maximum number of the pages are found. The max_pages is optional. If > max_pages is specified, it must be equal or greater than the vec_size. > This restriction is needed to handle worse case when one page_region only > contains info of one page and it cannot be compacted. This is needed to > emulate the Windows getWriteWatch() syscall. > > Some non-dirty pages get marked as dirty because of the kernel's > internal activity (such as VMA merging as soft-dirty bit difference isn't > considered while deciding to merge VMAs). The dirty bit of the pages is > stored in the VMA flags and in the per page flags. If any of these two bits > are set, the page is considered to be soft dirty. Suppose you have cleared > the soft dirty bit of half of VMA which will be done by splitting the VMA > and clearing soft dirty bit flag in the half VMA and the pages in it. Now > kernel may decide to merge the VMAs again. So the half VMA becomes dirty > again. This splitting/merging costs performance. The application receives > a lot of pages which aren't dirty in reality but marked as dirty. > Performance is lost again here. Also sometimes user doesn't want the newly > allocated memory to be marked as dirty. PAGEMAP_NO_REUSED_REGIONS flag > solves both the problems. It is used to not depend on the soft dirty flag > in the VMA flags. So VMA splitting and merging doesn't happen. It only > depends on the soft dirty bit of the individual pages. Thus by using this > flag, there may be a scenerio such that the new memory regions which are > just created, doesn't look dirty when seen with the IOCTL, but look dirty > when seen from procfs. This seems okay as the user of this flag know the > implication of using it. Please separate that part out from the other changes; I am still not convinced that we want this and what the semantical implications are. Let's take a look at an example: can_change_pte_writable() /* Do we need write faults for softdirty tracking? */ if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte)) return false; We care about PTE softdirty tracking, if it is enabled for the VMA. Tracking is enabled if: vma_soft_dirty_enabled() /* * Soft-dirty is kind of special: its tracking is enabled when * the vma flags not set. */ return !(vma->vm_flags & VM_SOFTDIRTY); Consequently, if VM_SOFTDIRTY is set, we are not considering the soft_dirty PTE bits accordingly. I'd suggest moving forward without this controversial PAGEMAP_NO_REUSED_REGIONS functionality for now, and preparing it as a clear add-on we can discuss separately. -- Thanks, David / dhildenb