From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C617EC6FD1D for ; Wed, 15 Mar 2023 16:54:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 31B9F6B0080; Wed, 15 Mar 2023 12:54:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2CB826B0083; Wed, 15 Mar 2023 12:54:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 16C7E6B0093; Wed, 15 Mar 2023 12:54:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 048396B0080 for ; Wed, 15 Mar 2023 12:54:58 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id BDFEB160246 for ; Wed, 15 Mar 2023 16:54:57 +0000 (UTC) X-FDA: 80571732234.11.D003F7B Received: from madras.collabora.co.uk (madras.collabora.co.uk [46.235.227.172]) by imf19.hostedemail.com (Postfix) with ESMTP id CC6051A001C for ; Wed, 15 Mar 2023 16:54:54 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=collabora.com header.s=mail header.b=SHQ3SHGg; dmarc=pass (policy=quarantine) header.from=collabora.com; spf=pass (imf19.hostedemail.com: domain of usama.anjum@collabora.com designates 46.235.227.172 as permitted sender) smtp.mailfrom=usama.anjum@collabora.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1678899295; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=r06RjJqr4Wdt0ir7HlVfMOl63693N31DlSYXdEhpu6I=; b=Dd9AMKof7Ok/DFgX2yP7drWNJGwW0TqfOo/R+LpMBfigEfLOgzwntZ1GJy+H8ggudttf1u mb3WPEmfc8RDzrjVArKv7LTvIm4K3YrJARZUTjZn2ETqk6J+m8uXugWz/AxNobZgl9BbxG nTEN5R+vghuokAUCNGn2tfX8dnZeZ1I= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=collabora.com header.s=mail header.b=SHQ3SHGg; dmarc=pass (policy=quarantine) header.from=collabora.com; spf=pass (imf19.hostedemail.com: domain of usama.anjum@collabora.com designates 46.235.227.172 as permitted sender) smtp.mailfrom=usama.anjum@collabora.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1678899295; a=rsa-sha256; cv=none; b=XfmGilLJSXir0TigBodZBDx5EZDNpu0huyUL+Mr+pScJeRIVTyu6gM2IIq5azt9wzlMW/q Ib17TbDkqaG9OTwDFheDsTrvEtgpUlca6eHPHNucAA6vFOW1xw/3AJwII4SQYWPQWvAIwb ObwbxGbsuHH27wD/O1oUSeJw2VMBm/I= Received: from [192.168.10.39] (unknown [39.37.168.222]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (No client certificate requested) (Authenticated sender: usama.anjum) by madras.collabora.co.uk (Postfix) with ESMTPSA id 3BBA56602065; Wed, 15 Mar 2023 16:54:46 +0000 (GMT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=collabora.com; s=mail; t=1678899292; bh=EyXR2YxwFVjBAcx1rUVLdzVeq5GifyBb6Z8dID0brCk=; h=Date:Cc:Subject:To:References:From:In-Reply-To:From; b=SHQ3SHGgunZCD9CQKxkcMrkNu82y/hn/LXDp+BkqMXc7hJnD5qh+gIhMIMTiEOxR7 EkaldNLGjTXTymZH3QmhQwWDCVW4Jeb9n47yKYEdRiLYkghf+cXEAC/nsxUXsFOU2Y m/EqUEI/mHPFRT+cSK436ppqmBN5zq99BRVK8+99lGAz0dYSBatwE8U9psqRXg6OLB XOXxGInAuGnjbKNFohYXlKgAy4J29WtLirHCECYN9Iz1eno/uOz+mTJgqF8rDdp3F6 UevrNe3irtE7kpeBmhfQ8nWRofKmCdG/rCgthAST7DOfk8vKWFLJRi7wtbLiQFdarr GzoyGjH6cfBCw== Message-ID: <3d2d1ba4-bfab-6b3d-f0d6-ae0920ebdcb0@collabora.com> Date: Wed, 15 Mar 2023 21:54:40 +0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.7.2 Cc: Muhammad Usama Anjum , David Hildenbrand , Andrew Morton , =?UTF-8?B?TWljaGHFgiBNaXJvc8WC?= =?UTF-8?Q?aw?= , Andrei Vagin , Danylo Mocherniuk , Paul Gofman , Cyrill Gorcunov , Mike Rapoport , Nadav Amit , Alexander Viro , Shuah Khan , Christian Brauner , Yang Shi , Vlastimil Babka , "Liam R . Howlett" , Yun Zhou , Suren Baghdasaryan , Alex Sierra , Matthew Wilcox , Pasha Tatashin , Axel Rasmussen , "Gustavo A . R . Silva" , Dan Williams , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Greg KH , kernel@collabora.com Subject: Re: [PATCH v11 4/7] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs Content-Language: en-US To: Peter Xu References: <20230309135718.1490461-1-usama.anjum@collabora.com> <20230309135718.1490461-5-usama.anjum@collabora.com> From: Muhammad Usama Anjum In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: CC6051A001C X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: 3i7y7ixdx4hy8pzjdwmdkniijn4uyszd X-HE-Tag: 1678899294-566732 X-HE-Meta: U2FsdGVkX1/8x2QVwCmo7CaSBDeW2BhGMR/eiRV4D2xTdDNja+ccl4aR7yqMuqcn5vEea5vyWoUvTQQ4BkLKPy5beXZV2kPCtdJKwWHDx8+0jt2uU6QC12TzMVzQMbyVaoeFAujysKWfnjSU26GkXDz9AqPcGR4Gn3guRDndB4ZuULKe6ZUYkMGaKVuKmsenLsxNoEVPFrOIYMv3JE21sGbKFb8uHqJ1nE2B+M/djhvVnaeDTknqS1xO/Xpqsa0ssNQs+NIk1XdyCsXwA95EyQRp2ZFC39+IERwjfXJuan1LtqX8CBq9JN3jUHnv36kiGgEFHGGL/ZHMi5rOiEN9/hgkz1FU1j8Y1QJuR8yCsE925qH6aW9AoyUuIHRlUUUaXKWFumQUCyHEKYdNVOTWq4R6lwgIchyk+O+3rp+DR47IOMgKDOm+JSj5WrJFO2H4y7XpOrR+6ZA3DhGyrfZK9zBRfj2tdwtvQ5HmbLTGHgQktz1097fx0uJ2zAzWsKUo0GBDe3NqidA46qBBxuwVjcGYd5v8PCiM+lpbMUYSPauBUgvZfWP0cOGhtzVKY2l50Cyb3tWEwn1SoKjvyf5O4j9a6a/kUW41Sb2ovJFUc6eVH746Kys7cUW5HyToXF7y4IQEkuf3ccp5uqn9myO8mdPUbuL3zW5mUI/fCpEXSLMfjNgX/TY7zHA3kzHpJCzfFlIZvZ9f7qq/z14UhlNn2CMi5Uaavk1UEcLiBjXjLNpP0uveRJ/RqR8iRi5WYphtBTsAH0ZCuRFJTprsTFIuoPYdk0ZR8eFCnzoXbJTxUVUhCXNWZcrxjOJNtgw+pODitbsQ9HsUpv8T3PULoVOc1N4/ASQ0Wryu0EzGii5T+BICN5Wd98/WYssXx7rJ7ELQULzhnhb02vQNCvcW0pYhSw70Mr3rKdX8gFZgyqnJzeH5LAGi6d6jUaC2zOPsfIz2ecJz6pkxNNJ7kCtlHYf lLBzElhQ J1pYTeJITnvYELQM8VQ96sp9mFhol3aHlbeszfg9Zq+q4Eks1iCEWpp5ISZPMw5G8U5Fgv5DNGzkx1l509Q3XS7IrlNuDzilyECVbKuywjG9eKEywfjrI2i2tz7IWj9zBZ8D5lrCtYqUIiFq81wsON7sDVWR+2fDZP1BpSrsIuL8k6veCeN3090q0QUGqBefOXymapX56hlD+eQ2CPkhupNaU30lKPB12eW2P0Z63TT23RSYTAVZgdXGBLzbqQhuK9/9iiYdb4zlzL0s= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 3/15/23 8:55 PM, Peter Xu wrote: > On Thu, Mar 09, 2023 at 06:57:15PM +0500, Muhammad Usama Anjum wrote: >> + for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) { >> + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); >> + >> + is_writ = !is_pte_uffd_wp(*pte); >> + is_file = vma->vm_file; >> + is_pres = pte_present(*pte); >> + is_swap = is_swap_pte(*pte); >> + >> + pte_unmap_unlock(pte, ptl); >> + >> + ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap, >> + p, addr, 1); >> + if (ret) >> + break; >> + >> + if (PM_SCAN_OP_IS_WP(p) && is_writ && >> + uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0) >> + ret = -EINVAL; >> + } > > This is not real atomic.. > > Taking the spinlock for eacy pte is not only overkill but wrong in > atomicity because the pte can change right after spinlock unlocked. Let me explain. It seems like wrong, but it isn't. In my rigorous testing, it didn't show any side-effect. Here we are finding out if a page is written. If page is written, only then we clear it. Lets look at the different possibilities here: - If a page isn't written, we'll not clear it. - If a page is written and there isn't any race, we'll clear written-to flag by write protecting it. - If a page is written but before clearing it, data is written again to the page. The page would remain written and we'll clear it. - If a page is written but before clearing it, it gets write protected, we'll still write protected it. There is double right protection here, but no side-effect. Lets turn this into a truth table for easier understanding. Here first coulmn and thrid column represents this above code. 2nd column represents any other thread interacting with the page. If page is written/dirty some other task interacts wp_page no does nothing no no writes to page no no wp the page no yes does nothing yes yes write to page yes yes wp the page yes As you can see there isn't any side-effect happening. We aren't over doing the wp or under-doing the write-protect. Even if we were doing something wrong here and I bring the lock over all of this, the pages get become written or wp just after unlocking. It is expected. This current implementation doesn't seem to be breaking this. Is my understanding wrong somewhere here? Can you point out? Previous to this current locking design were either buggy or slower when multiple threads were working on same pages. Current implementation removes the limitations: - The memcpy inside pagemap_scan_output is happening with pte unlocked. - We are only wp a page if we have noted this page to be dirty - No mm write lock is required. Only read lock works fine just like userfaultfd_writeprotect() takes only read lock. There is only one con here that we are locking and unlocking the pte lock again and again. Please have a look at my explanation and let me know what do you think. > > Unfortunately you also cannot reuse uffd_wp_range() because that's not > atomic either, my fault here. Probably I was thinking mostly from > soft-dirty pov on batching the collect+reset. > > You need to take the spin lock, collect whatever bits, set/clear whatever > bits, only until then release the spin lock. > > "Not atomic" means you can have some page got dirtied but you could miss > it. Depending on how strict you want, I think it'll break apps like CRIU > if strict atomicity needed for migrating a process. If we want to have a > new interface anyway, IMHO we'd better do that in the strict way. In my rigorous multi-threaded testing where a lots of threads are working on same set of pages, we aren't losing even a single update. I can share the test if you want. > > Same comment applies to the THP handling (where I cut from the context). > -- BR, Muhammad Usama Anjum