From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AD114C433EF for ; Thu, 3 Feb 2022 22:17:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B81386B0074; Thu, 3 Feb 2022 17:17:17 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B2FE26B0075; Thu, 3 Feb 2022 17:17:17 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9D1226B0078; Thu, 3 Feb 2022 17:17:17 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0048.hostedemail.com [216.40.44.48]) by kanga.kvack.org (Postfix) with ESMTP id 90CBC6B0074 for ; Thu, 3 Feb 2022 17:17:17 -0500 (EST) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 551A198C13 for ; Thu, 3 Feb 2022 22:17:17 +0000 (UTC) X-FDA: 79102880514.19.E60AF0B Received: from smtp-relay-internal-0.canonical.com (smtp-relay-internal-0.canonical.com [185.125.188.122]) by imf30.hostedemail.com (Postfix) with ESMTP id BD85C80002 for ; Thu, 3 Feb 2022 22:17:16 +0000 (UTC) Received: from mail-pg1-f198.google.com (mail-pg1-f198.google.com [209.85.215.198]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-internal-0.canonical.com (Postfix) with ESMTPS id 8B9003FE4B for ; Thu, 3 Feb 2022 22:17:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=canonical.com; s=20210705; t=1643926634; bh=7D8Lle+FlZ+qvq0z0waDd40W76EwTMH9mbE1wJMW4I0=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=N+yDLu1ckij5+ytLBxtr84q9LDjaNyCDpI9bwSYtaTZr+ybsHYMmb63jsAEyHhEHR mRWMr7q81Obwiy0W/vAFnhLLraEomtHsW9HQnE1R8XMmQ9K8fjpridptmYeUrp6ej8 IdqydBVsLX7KLl0HQLyoemsR7pSjRsxY71TX2NqoUY/CF94P4Fx5oqLxOO+nmLQ9UL OrfeYXxRxJoRjWEef5jbmNEXT/tYc4s01fOxKdyOpoL5XzDuUbqpkoGXJn8VYDbby6 F6g8HMgqMSuoHnn+4vWcqvxKzNrmj+aC+mHUeI1Q8NJOJ1y1IO1MGBh+oegh5FRmTh eYCF6t4V82VBg== Received: by mail-pg1-f198.google.com with SMTP id i23-20020a635417000000b00364c29f39aaso2005572pgb.8 for ; Thu, 03 Feb 2022 14:17:14 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=7D8Lle+FlZ+qvq0z0waDd40W76EwTMH9mbE1wJMW4I0=; b=CDlXp6s65g5+UvX0E273w9Ybd5L5kcS2mCoiAwnEdEiYS9OsbK6RiK9dHrBXeDWbAX hQTsU4oYphxn27r8YM43l3n0yv89E+gJeyKQnZGidm/xvJaDHifwONCF3Zt5irgmwhOC +MqEuZ6daRv8XOC307OMdswuDFLjvEFUqO5diWK5j9f4neL9OV2bEj/B9UbSEY7osuNy IcYutTS6+AVwi6k7p3yx735JJM9V9wjbOgwjQEBVrmLrxgZ58OLGUAf0LWSXSV4G5giS y9/pEFzSEJ2y54p6SiN3eSG2UvYePtRNbGGdTM7/e2rKqlfXNBig/rykXK9MJ2jWlph7 d9Og== X-Gm-Message-State: AOAM533ADyaSykUoE/MGJIwLD60yq67wMhOzswe5E6p8s6IXhFxJH52a NGCC7Ti8MDzoKL0bhRBjIKH4Rpv8dp3PjwirbOKgxqn35ctuV7vYSXdkNHCvLhscAEdOH3MocGg Py10i/3ISDPvXIklarj6z4xZi0HRYNMs5Dywzr/k63I3d X-Received: by 2002:a63:3302:: with SMTP id z2mr125902pgz.209.1643926633163; Thu, 03 Feb 2022 14:17:13 -0800 (PST) X-Google-Smtp-Source: ABdhPJwGSt8Il3egvwwKWg9+ox02Ly5+kjHmNJeIqHnVu+ckiD+zBc/LubiBkPJIrc4ntkSIGLh8cTSbP1E+hjUdHCU= X-Received: by 2002:a63:3302:: with SMTP id z2mr125874pgz.209.1643926632838; Thu, 03 Feb 2022 14:17:12 -0800 (PST) MIME-Version: 1.0 References: <20220131230255.789059-1-mfo@canonical.com> In-Reply-To: From: Mauricio Faria de Oliveira Date: Thu, 3 Feb 2022 19:17:00 -0300 Message-ID: Subject: Re: [PATCH v3] mm: fix race between MADV_FREE reclaim and blkdev direct IO read To: Yu Zhao Cc: Minchan Kim , "Huang, Ying" , Andrew Morton , Yang Shi , Miaohe Lin , linux-mm@kvack.org, linux-block@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspam-User: nil X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: BD85C80002 X-Stat-Signature: tgpssen8nfk8jq9tfo9gq7i855puxew7 Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=canonical.com header.s=20210705 header.b=N+yDLu1c; spf=pass (imf30.hostedemail.com: domain of mauricio.oliveira@canonical.com designates 185.125.188.122 as permitted sender) smtp.mailfrom=mauricio.oliveira@canonical.com; dmarc=pass (policy=none) header.from=canonical.com X-HE-Tag: 1643926636-519665 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Feb 2, 2022 at 6:53 PM Yu Zhao wrote: > > On Wed, Feb 02, 2022 at 06:27:47PM -0300, Mauricio Faria de Oliveira wrote: > > On Wed, Feb 2, 2022 at 4:56 PM Yu Zhao wrote: > > > > > > On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote: > > > > Problem: > > > > ======= > > > > > > Thanks for the update. A couple of quick questions: > > > > > > > Userspace might read the zero-page instead of actual data from a > > > > direct IO read on a block device if the buffers have been called > > > > madvise(MADV_FREE) on earlier (this is discussed below) due to a > > > > race between page reclaim on MADV_FREE and blkdev direct IO read. > > > > > > 1) would page migration be affected as well? > > > > Could you please elaborate on the potential problem you considered? > > > > I checked migrate_pages() -> try_to_migrate() holds the page lock, > > thus shouldn't race with shrink_page_list() -> with try_to_unmap() > > (where the issue with MADV_FREE is), but maybe I didn't get you > > correctly. > > Could the race exist between DIO and migration? While DIO is writing > to a page, could migration unmap it and copy the data from this page > to a new page? > Thanks for clarifying. I started looking into this, but since it's unrelated to MADV_FREE (which doesn't apply to page migration), I guess this shouldn't block this patch, if at all possible. Is that OK with you? > > > > @@ -1599,7 +1599,30 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > > > /* MADV_FREE page check */ > > > > if (!PageSwapBacked(page)) { > > > > - if (!PageDirty(page)) { > > > > + int ref_count, map_count; > > > > + > > > > + /* > > > > + * Synchronize with gup_pte_range(): > > > > + * - clear PTE; barrier; read refcount > > > > + * - inc refcount; barrier; read PTE > > > > + */ > > > > + smp_mb(); > > > > + > > > > + ref_count = page_count(page); > > > > + map_count = page_mapcount(page); > > > > + > > > > + /* > > > > + * Order reads for page refcount and dirty flag; > > > > + * see __remove_mapping(). > > > > + */ > > > > + smp_rmb(); > > > > > > 2) why does it need to order against __remove_mapping()? It seems to > > > me that here (called from the reclaim path) it can't race with > > > __remove_mapping() because both lock the page. > > > > I'll improve that comment in v4. The ordering isn't against __remove_mapping(), > > but actually because of an issue described in __remove_mapping()'s comments > > (something else that doesn't hold the page lock, just has a page reference, that > > may clear the page dirty flag then drop the reference; thus check ref, > > then dirty). > > Got it. IIRC, get_user_pages() doesn't imply a write barrier. If so, > there should be a smp_wmb() on the other side: If I understand it correctly, it actually implies a full memory barrier, doesn't it? Because... gup_pte_range() (fast path) calls try_grab_compound_head(), which eventually calls* atomic_add_unless(), an atomic conditional RMW operation with return value, thus fully ordered on success (atomic_t.rst); (on failure gup_pte_range() falls back to the slow path, below.) And follow_page_pte() (slow path) calls try_grab_page(), which also calls into try_grab_compound_head(), as the above. (* on CONFIG_TINY_RCU, it calls just atomic_add(), which isn't ordered, but that option is targeted for UP/!SMP, thus not a problem for this race.) Looking at the implementation of arch_atomic_fetch_add_unless() on more relaxed/weakly ordered archs (arm, powerpc, if I got that right), there are barriers like 'smp_mb()' and 'sync' instruction if 'old != unless', so that seems to be OK. And the set_page_dirty() calls occur after get_user_pages() / that point. Does that make sense? Thanks! > > * get_user_pages(&page); > > smp_wmb() > > * SetPageDirty(page); > * put_page(page); > > (__remove_mapping() doesn't need smp_[rw]mb() on either side because > it relies on page refcnt freeze and retesting.) > > Thanks. -- Mauricio Faria de Oliveira