From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 11AAEC433EF for ; Wed, 2 Feb 2022 16:29:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 88D256B02BD; Wed, 2 Feb 2022 11:29:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 83CD36B02BE; Wed, 2 Feb 2022 11:29:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 703EC6B02BF; Wed, 2 Feb 2022 11:29:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0205.hostedemail.com [216.40.44.205]) by kanga.kvack.org (Postfix) with ESMTP id 61AE96B02BD for ; Wed, 2 Feb 2022 11:29:41 -0500 (EST) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 1615E824C447 for ; Wed, 2 Feb 2022 16:29:41 +0000 (UTC) X-FDA: 79098375762.15.BDAC1BE Received: from smtp-relay-internal-1.canonical.com (smtp-relay-internal-1.canonical.com [185.125.188.123]) by imf31.hostedemail.com (Postfix) with ESMTP id AB26320006 for ; Wed, 2 Feb 2022 16:29:40 +0000 (UTC) Received: from mail-pj1-f72.google.com (mail-pj1-f72.google.com [209.85.216.72]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-internal-1.canonical.com (Postfix) with ESMTPS id 933143F1D9 for ; Wed, 2 Feb 2022 16:29:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=canonical.com; s=20210705; t=1643819378; bh=llyfKV+ioP7EWk9AqLoXWJ66pPJCl2G3speZl7pY9Vs=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=RjNlUd7gDXmmhdxoO4VNu2VWp2h6N6Ki2t69js3Nha8wf0CCq40LlLrjdpNoonB+y e2DckuERIbLrJWuLSmUf/ef2Nxj/htutbd3ZN4nz55GJhyKEUKBOudc6go1pz/oepe owuohnOdMeYN9Wz1Zj3hlw79N83FSvqTbN17tmF2AoEjT/hAe5I6JW7vvmrSAvB8Ae 3oUIJPl30MduswS/8gZtCSkEVQKg9uQpN0MaWGg7YLrZoXoLNEnS8q/WgfOvl7XvCm tUv8kpgi1OUKO5qH9Q6OxXD83cZoyohUhrjQRQd5NmXhvgQpm6CTzO1QbWt+LpkTWU pdCd1NuFzMpmg== Received: by mail-pj1-f72.google.com with SMTP id nm23-20020a17090b19d700b001b7fb7ef9aaso4837888pjb.4 for ; Wed, 02 Feb 2022 08:29:38 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=llyfKV+ioP7EWk9AqLoXWJ66pPJCl2G3speZl7pY9Vs=; b=WD3RGNRZzRpEcTZqyyTtCmjh5oRPHfTakpuubpZklZBDYbUJrjJsi3AYD962IG0hmL sFoQUQLjQskbJUvT/RH6l/erA2Lxxg1s11DbwKf7g51vUs7WmtizfbE405tUaEgyWAuT LOjZAHXY+VaUdcwvCwMr3tqthZ7HYdDoQnb5Ak6MiJRUtWR7DTahEtsFnMdu7oAPQk1E lMyoFFwaAxVVfgv1xue3TVFYJD2NDf/SRDuXVEHUK3ua1nTgF35ExqmrFuPE0GKAMv9S mkFzNyvDkkoAPYyFADKmyRJYUQ3cXJV1L7+uyvtITnlMqghzCNn1v5o/5zo9ux6uapi9 ruCQ== X-Gm-Message-State: AOAM532SXFN+JHo6idEdM3TJ7eeqMbmLI4WH/prXbvP7W5/3jT82F0dk Az2jR4sPVa0VtRgM6FYvKNAuT/jpRswN4q52+aABLH0N8a5Gjbv8H/F/JRFcwWEeIx8IttHcKsu kskFLultKe83PP5hXF0tvTlBxdy0zZuNINpnblW1Q3tlL X-Received: by 2002:a17:90a:df0e:: with SMTP id gp14mr9003823pjb.57.1643819377093; Wed, 02 Feb 2022 08:29:37 -0800 (PST) X-Google-Smtp-Source: ABdhPJyswMWCa52XqgcIu1WL5JLiOUhXyaQDnDmcd5FFFXmKagloRfAQe7okpAd3lh2G339PdTc6V4gHf+nS6pDUjNM= X-Received: by 2002:a17:90a:df0e:: with SMTP id gp14mr9003800pjb.57.1643819376835; Wed, 02 Feb 2022 08:29:36 -0800 (PST) MIME-Version: 1.0 References: <20220131230255.789059-1-mfo@canonical.com> In-Reply-To: From: Mauricio Faria de Oliveira Date: Wed, 2 Feb 2022 13:29:25 -0300 Message-ID: Subject: Re: [PATCH v3] mm: fix race between MADV_FREE reclaim and blkdev direct IO read To: Christoph Hellwig Cc: Minchan Kim , "Huang, Ying" , Yu Zhao , Andrew Morton , Yang Shi , Miaohe Lin , linux-mm@kvack.org, linux-block@vger.kernel.org, axboe@kernel.dk, John Hubbard Content-Type: text/plain; charset="UTF-8" X-Rspam-User: nil X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: AB26320006 X-Stat-Signature: gjs71miuwxkkaxc38xr1c4kmxtj16mrj Authentication-Results: imf31.hostedemail.com; dkim=pass header.d=canonical.com header.s=20210705 header.b=RjNlUd7g; dmarc=pass (policy=none) header.from=canonical.com; spf=pass (imf31.hostedemail.com: domain of mauricio.oliveira@canonical.com designates 185.125.188.123 as permitted sender) smtp.mailfrom=mauricio.oliveira@canonical.com X-HE-Tag: 1643819380-427270 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Feb 2, 2022 at 11:03 AM Christoph Hellwig wrote: > > On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote: > > Well, blkdev_direct_IO() gets references for all pages, and on READ > > operations it only sets them dirty _later_. > > > > So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for > > direct IO read from block devices, and page reclaim happens during > > __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages() > > returns, but BEFORE the pages are set dirty, the situation happens. > > > > The direct IO read eventually completes. Now, when userspace reads > > the buffers, the PTE is no longer there and the page fault handler > > do_anonymous_page() services that with the zero-page, NOT the data! > > So why not just set the pages dirty early like the other direct I/O > implementations? Or if this is fine with the patch should we remove > the early dirtying elsewhere? In general, since this particular problem is specific to MADV_FREE, it seemed about right to go for a more contained/particular solution (than changes with broader impact/risk to things that aren't broken). This isn't to say either approach shouldn't be pursued, but just that the larger changes aren't strictly needed to actually fix _this_ issue (and might complicate landing the fix into the stable/distro kernels.) Now, specifically on the 2 suggestions you mentioned, I'm not very familiar with other implementations, thus I can't speak to that, sorry. However, on the 1st suggestion (set pages dirty early), John noted [1] there might be issues with that and advised not going there. > > > Reproducer: > > ========== > > > > @ test.c (simplified, but works) > > Can you add this to blktests or some other regularly run regression > test suite? Sure. The test also needs the kernel-side change (to trigger memory reclaim), which can probably be wired for blktests with a fault-injection capability. Does that sound good? Maybe there's a better way to do it. > > > + smp_rmb(); > > + > > + /* > > + * The only page refs must be from the isolation > > + * plus one or more rmap's (dropped by discard:). > > Overly long line. Hmm, checkpatch.pl didn't complain about it. Ah, it checks for 100 chars. Ok; v4. > > > + */ > > + if ((ref_count == 1 + map_count) && > > No need for the inner braces. > Ok; v4. I'll wait a bit in case more changes are needed, and send v4 w/ the above. Thanks! [1] https://lore.kernel.org/linux-mm/7094dbd6-de0c-9909-e657-e358e14dc6c3@nvidia.com/ -- Mauricio Faria de Oliveira