From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 039A5C6FA8E for ; Tue, 28 Feb 2023 15:58:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9F5096B0072; Tue, 28 Feb 2023 10:58:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9A4F66B0073; Tue, 28 Feb 2023 10:58:13 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 86E0B6B0074; Tue, 28 Feb 2023 10:58:13 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 77B516B0072 for ; Tue, 28 Feb 2023 10:58:13 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 421CB1404EC for ; Tue, 28 Feb 2023 15:58:13 +0000 (UTC) X-FDA: 80517157266.28.D321969 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf29.hostedemail.com (Postfix) with ESMTP id 2F789120017 for ; Tue, 28 Feb 2023 15:58:11 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=TEGRO9yo; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf29.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677599891; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=LqHzMmMAnP1j9fnVg4LOhn3aCa8U954IIbyJ+XoGXFE=; b=QE3nGlhR1lRUAMD6zxcmU9Oq7OMPGmnh4OKSEBprI4Ac0XEnu9qzDejYl/WzmNdFJ/AEa/ eqxDNo7EqEdbLQrMuWe03tGemmXKcwFuwbuBBEQS69k9EU/tPL0ayS69X0ORl/xmUpBedp hEVR8AuQgjeJCR1rMBA5J/y7RnPd1DM= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=TEGRO9yo; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf29.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677599891; a=rsa-sha256; cv=none; b=UQ53f0wbpvusJDu4vSHdo2U3n2mkJQkh7Wa3pzp0CkYMrPDvKG8X91h6kaj9h4uHqtkkow Wy6l1KraEQ0OvoXSch65ImAyDuLA2qrYChpBBbtpa5vutkcNpcbXh3WebOQUk2L3UzgzW/ VxEU2oDlRUim3VMqg0gnBYZQ0bhvM2g= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1677599890; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=LqHzMmMAnP1j9fnVg4LOhn3aCa8U954IIbyJ+XoGXFE=; b=TEGRO9yoZbrxVYA0mj8SvtCb8i5hssLz6zMqkmaPXN0bs7Pe5nleZvbUSHw5chsJcbO6kZ 1YLvapcHbOYOl/EH1ofX58PhhByDTTIXXq6Evdoyuewj8hqsp652IG2o38w/A3wZ/B9Ykx M4ZyDl0oZilFzu7fhLHBSacDx36H8N8= Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com [209.85.222.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-568-7uJXbUMCO86nAGHGhPNfSg-1; Tue, 28 Feb 2023 10:58:09 -0500 X-MC-Unique: 7uJXbUMCO86nAGHGhPNfSg-1 Received: by mail-qk1-f198.google.com with SMTP id bj3-20020a05620a190300b007422ed6c435so6222381qkb.11 for ; Tue, 28 Feb 2023 07:58:09 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1677599889; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=LqHzMmMAnP1j9fnVg4LOhn3aCa8U954IIbyJ+XoGXFE=; b=wlI1jflDSRo+j9DHKBmKdjwQIzd5WXjDG5y57oEzNgT8mzbGjYmI0FiZpX7dizVxXz i9toWEdxNMIl6KNpZPpem0RHAxkJxu7L/CRbtXZ5HGF2Y6VGRa2qoBINyxifApMRlhzm Wp2EQPUB+QMjDEQnF0bZ4AJ22CAbkDe0DKivfIkD6ijmSmNi+oeQWx8DqGIvBdGundyq kTSUsS9dZzn/D+inmr357TUqK0fNijGeZzEO2nm8pCro3p1hrhb7296i2Hei36Zbjcbl YzY5SHJHOWoLEFgqUk2ohghWcYwPMfrBwX9SP1jy0z+YTbMq0L228rUkI1AFRLr1WH8N ZWnA== X-Gm-Message-State: AO0yUKW3Si4qMlod0jkTEaz1GzYixmxhdw/aH2ZgEvoPqKC8lhDCcehZ GpwKJdHrmUPGLsRW/WHgpaBZtxxVBqNN007Bg2exrwUFdb8++9bMusUkHdan93bAq52iOkJl/7f hOT/HQ0UCeK0= X-Received: by 2002:a05:622a:1391:b0:3bf:c38c:1d6c with SMTP id o17-20020a05622a139100b003bfc38c1d6cmr5768470qtk.2.1677599888767; Tue, 28 Feb 2023 07:58:08 -0800 (PST) X-Google-Smtp-Source: AK7set+U773QNP+ssZhiqbrn5y7DM4uIzHgQC6u+cRGTVrfYVTaGdnUQdsYt5XzYtHnQvfU2XGVSgQ== X-Received: by 2002:a05:622a:1391:b0:3bf:c38c:1d6c with SMTP id o17-20020a05622a139100b003bfc38c1d6cmr5768446qtk.2.1677599888461; Tue, 28 Feb 2023 07:58:08 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id v22-20020a05622a131600b003b8558eabd0sm6845693qtk.23.2023.02.28.07.58.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 28 Feb 2023 07:58:07 -0800 (PST) Date: Tue, 28 Feb 2023 10:58:06 -0500 From: Peter Xu To: Muhammad Usama Anjum Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrea Arcangeli , Andrew Morton , Mike Rapoport , Axel Rasmussen , Nadav Amit , David Hildenbrand , "kernel@collabora.com" Subject: Re: [PATCH v2] mm/uffd: UFFD_FEATURE_WP_UNPOPULATED Message-ID: References: <20230227230044.1596744-1-peterx@redhat.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 2F789120017 X-Stat-Signature: mjy3dhbrz4tztc8eixba8xp8wbi8m3ah X-HE-Tag: 1677599891-682372 X-HE-Meta: U2FsdGVkX18ySuO/aTIM39SoUYE/znIUZMD0+UkBP4FnPR/cp2+DH0FDwnclhXEUPaQC07h6yo3zHPQgEYhYxa7+KcPx0g8bweNz5ahnN8GBWz7YfvTsRmSNoN2UADwqFXGGXDa5I0tOQvynMF5ErD4+eR4DarFQduk7cxWReySJ+436HhAqaWksWFKPs07B6fCVZPf0CYoqaX9MFF523IBerXzaPv1VcmEo2U76D1fEKXAe/dfd7lC+G1qQ8QHcWNcm68zR/qc8fXHCQO7rPW23kiEbCY/ZFosm7X18WuPqIY1NDcvp0CCNcxoZR6kIBxEUv/dTaIQu/4LUCzc+r434N5RiLS0cdZ8N7kPp0JWM1xOEUm572vxBtR0bcfS+0hUe50yvPJX26JMM6rJ1innrVnXieHmL52yBq3YoRyg9/2vpIek2/X5WSncMqOF+/oGZQhFF6MSY/MI4M1v2yaFBJNeJlQUSSmc90dKM7GqFd4T0/LjKWEnFYIz1vi0BiYeI+C+PFqqH7TK1Ltx/xGxncB7rPlP5LfNRr6rmoQtFm4Us5Qsa51yvICy+2rPVChwap8vV7lkWMXxlzWzfZ0IwqUlGPx+F21i7hdnps5JpNqabVaaU0eZ97J65fgJebpOoqqTiBGDZzpRmtL25NUdf9uY3hyTpyUMdbnktbxMoq2TITTLgceeUxK2r2v8OWZF2785NSsBqHT4Cmdfd3NWeQ7W9z7T6DUT6bE57oRvOQ1oJuT39qGv8OOvpMIltpxjvQQ0a6anmlugHWsJThxQs2cDksunnZ/D/8psuCxsm0c25BgYyFByZ2ViFNni1hlgYfbKg4JBYqEC3cUbuW9iUdL6zcgJeqs9ARIEtW216h5YD4XQ9kAgAPxUwbugQQVOhJLIUr4tGU58kqa/TjV7R5uqw45tkdh/AbfSoyRA3MN+FQO9geH5Hh3HgAf4ZvCQLvIm2wnOzT5TXRuq 1hLWXpli O8lTsp73uGx8KBal5yU/o6ECG/z0IpOeZ1AYZy47lqrWwfmypaUY32xZmPj5c0dpnPcLs2C4aditVrr7N+6VgZy9Gye4D3CcHju8QxSz6RWf9qBKqDZU39D6rEfjF67m9P2O4y5WOed1j5VlobIjeFmU2lYVyqzAL0MVpFWmbaf6/gh6zNaYZEk1geKs1M4zStDdcL8hSVt05SRYI1HO1Btzis7GQwTKmaQ9Iiy4/dUN1YcHp0VH2vA8i0drwKTbmTOoa6deVbJP/IIZzxuyJee+SWFmhAlQnv6/S3zTm96XGiaFY0GNbyOyx+D3LMMtX2lSNRCeLPZUrnK2mmJnXXAghIFFXle5VHU+28giT2kXOH7Ce6Ld5+wOFo82bY+EAkrjPCQK6P51Fa0bWdGZx6218dRjXdd0vdHG4h1UDGCwxMCqfER0uPdFqjV0nmk6SJAQO6G52W5nahilcHnI0nEqayUZ6KBHuBAbX4JL0Qre21ab5LK29Qq+/eahhyFbLenbAAFKq2Go6Ujqy4vDC1GKYUaCSB1j+kWpbs6H166EvOSmXqeiJ8reAgh+bR+Nufs5w5kZhyszyOsy4ZI8dvH8kgF5qn9jkWYMzDmcBeFlOZgXEN4VhPSXotggdlcvVs8atL3cETyDOvSBEatXhX0siVDUfZsFM4MWCZIEKjLQ7p5U= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 28, 2023 at 12:21:45PM +0500, Muhammad Usama Anjum wrote: > Hi Peter, Hi, Muhammad, > > Thank you so much for sending. > > On 2/28/23 5:36 AM, Peter Xu wrote: > > On Mon, Feb 27, 2023 at 06:00:44PM -0500, Peter Xu wrote: > >> This is a new feature that controls how uffd-wp handles none ptes. When > >> it's set, the kernel will handle anonymous memory the same way as file > >> memory, by allowing the user to wr-protect unpopulated ptes. > >> > >> File memories handles none ptes consistently by allowing wr-protecting of > >> none ptes because of the unawareness of page cache being exist or not. For > >> anonymous it was not as persistent because we used to assume that we don't > >> need protections on none ptes or known zero pages. > >> > >> One use case of such a feature bit was VM live snapshot, where if without > >> wr-protecting empty ptes the snapshot can contain random rubbish in the > >> holes of the anonymous memory, which can cause misbehave of the guest when > >> the guest OS assumes the pages should be all zeros. > >> > >> QEMU worked it around by pre-populate the section with reads to fill in > >> zero page entries before starting the whole snapshot process [1]. > >> > >> Recently there's another need raised on using userfaultfd wr-protect for > >> detecting dirty pages (to replace soft-dirty in some cases) [2]. In that > >> case if without being able to wr-protect none ptes by default, the dirty > >> info can get lost, since we cannot treat every none pte to be dirty (the > >> current design is identify a page dirty based on uffd-wp bit being cleared). > >> > >> In general, we want to be able to wr-protect empty ptes too even for > >> anonymous. > >> > >> This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make > >> uffd-wp handling on none ptes being consistent no matter what the memory > >> type is underneath. It doesn't have any impact on file memories so far > >> because we already have pte markers taking care of that. So it only > >> affects anonymous. > >> > >> The feature bit is by default off, so the old behavior will be maintained. > >> Sometimes it may be wanted because the wr-protect of none ptes will contain > >> overheads not only during UFFDIO_WRITEPROTECT (by applying pte markers to > >> anonymous), but also on creating the pgtables to store the pte markers. So > >> there's potentially less chance of using thp on the first fault for a none > >> pmd or larger than a pmd. > >> > >> The major implementation part is teaching the whole kernel to understand > >> pte markers even for anonymously mapped ranges, meanwhile allowing the > >> UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the > >> new feature bit is set. > >> > >> Note that even if the patch subject starts with mm/uffd, there're a few > >> small refactors to major mm path of handling anonymous page faults. But > >> they should be straightforward. > >> > >> So far, add a very light smoke test within the userfaultfd kselftest > >> pagemap unit test to make sure anon pte markers work. > >> > >> [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/ > >> [1] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/ > >> > >> Signed-off-by: Peter Xu > >> --- > >> v1->v2: > >> - Use pte markers rather than populate zero pages when protect [David] > >> - Rename WP_ZEROPAGE to WP_UNPOPULATED [David] > > > > Some very initial performance numbers (I only ran in a VM but it should be > > similar, unit is "us") below as requested. The measurement is about time > > spent when wr-protecting 10G range of empty but mapped memory. It's done > > in a VM, assuming we'll get similar results on bare metal. > > > > Four test cases: > > > > - default UFFDIO_WP > > - pre-read the memory, then UFFDIO_WP (what QEMU does right now) > > - pre-fault using MADV_POPULATE_READ, then default UFFDIO_WP > > - UFFDIO_WP with WP_UNPOPULATED > > > > Results: > > > > Test DEFAULT: 2 > > Test PRE-READ: 3277099 (pre-fault 3253826) > > Test MADVISE: 2250361 (pre-fault 2226310) > > Test WP-UNPOPULATE: 20850 > > > > I'll add these information into the commit message when there's a new > > version. > I'm hitting a bug where I'm unable to write to the memory after adding this > patch and wp the memory. I'm hitting this case in your test and my tests as > well. Please apply the following diff to your test to reproduce on your end: > > --- uffd_wp_perf.c.orig 2023-02-28 12:09:38.971820791 +0500 > +++ uffd_wp_perf.c 2023-02-28 12:13:11.077827160 +0500 > @@ -114,6 +114,7 @@ > start1 = get_usec(); > } > wp_range(uffd, buffer, SIZE, true); > + buffer[0] = 'a'; > if (start1 == 0) > printf("%"PRIu64"\n", get_usec() - start); > else This is expected, because the test didn't start any fault resolving thread, so the write will block until someone unprotects the page. But it shouldn't happen to your use case if you applied both WP_UNPOPULATED & WP_ASYNC. Could you try "cat /proc/$PID/stack" to see where does your thread stuck at? -- Peter Xu