From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 171B0C433F5 for ; Wed, 12 Jan 2022 21:53:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3E2086B0071; Wed, 12 Jan 2022 16:53:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 36AA26B0073; Wed, 12 Jan 2022 16:53:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1EF356B0074; Wed, 12 Jan 2022 16:53:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0181.hostedemail.com [216.40.44.181]) by kanga.kvack.org (Postfix) with ESMTP id 0757C6B0071 for ; Wed, 12 Jan 2022 16:53:26 -0500 (EST) Received: from smtpin31.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id B486C181E7264 for ; Wed, 12 Jan 2022 21:53:25 +0000 (UTC) X-FDA: 79022986770.31.69AC0FC Received: from smtp-relay-internal-0.canonical.com (smtp-relay-internal-0.canonical.com [185.125.188.122]) by imf05.hostedemail.com (Postfix) with ESMTP id 1F334100009 for ; Wed, 12 Jan 2022 21:53:25 +0000 (UTC) Received: from mail-pl1-f197.google.com (mail-pl1-f197.google.com [209.85.214.197]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-internal-0.canonical.com (Postfix) with ESMTPS id 8598640033 for ; Wed, 12 Jan 2022 21:53:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=canonical.com; s=20210705; t=1642024402; bh=Y1Ve+1LJZDPSGyIp/E8azQkkAj72SZxMSASpb4nWKgc=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=CNUxFulcPsAejpyTBKcNf/xcW6UkHjTWQtGuLuK9Zie/cgED+6e4EW6jpzDOVvr2p sHU9zBeDMV5v5bLMTb8NA39sgB2jOngdR6Zr02h83gvpq5iYkGSKxr6PKmt/RlYCrZ 97rzkC9QPKLAHFnUiNNGQOm5iUD9bCcw/sEpyWPVr+lHoSdAlnjRSsmgPzMI3ScgHO +1kHPZuWCoWdeQIP6s/uWnUSFSFb3vsC8zW+Gp74GV15Hwx6XkxH2JBMOgTxc4Qgwa HLW3VC0ZMGLTX5eW//fJliwK2zyOHFgy13q91riRNj8OHwPBbaPRiVKUVltkrAcfLx FAB/Jv9iJvRiw== Received: by mail-pl1-f197.google.com with SMTP id n16-20020a170902e55000b0014a67e64011so3411670plf.15 for ; Wed, 12 Jan 2022 13:53:22 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Y1Ve+1LJZDPSGyIp/E8azQkkAj72SZxMSASpb4nWKgc=; b=lZHnOgSHkJ/n8uroXJxUcyiT9HD0VAaxqI06y5KgzxvaIWzNV9VVmLqLFneANfUseq So5VMd/b1MtiFma5gIA/JCsKCQn14tQ24B0J6LQ0OGYkimrByI8LgW7s4OxFKNGQ2+Eu KzZtZgRQbeCbyHmH6rL3vj0cEv1JzfyxoehD2opNex/5MGpndE/5ieF8fmuuUaUcMpW7 O6iddLGWXaOKg6Ofdp+x3DzhkzIrnVqxRikwg2B5P6uc29I4FKQtZtg8VLMDbg7meluk pmEJ0K5YkGwoYTctzI/yQA0xftpzUo7DBX71onNa494usN6RfqUcZPrQ4o4S0Vlr9/Z7 aRZQ== X-Gm-Message-State: AOAM530MSMogrXQwnXRGATIFLtUb8OI5C+DHSPgGMZfC0RTgkqClLvG1 AThhCMAp0s17yOsKh84kDppAP2g/m2ZQ+qW4Xef1RINwd1wy9esEA/y1KJZKEQ5t0kAPgplNvXe 9YoLnhXCoao7bGsH129VM7Eq0o1hMadnmDqH0Ho4AYHbi X-Received: by 2002:a63:af07:: with SMTP id w7mr1419747pge.209.1642024401030; Wed, 12 Jan 2022 13:53:21 -0800 (PST) X-Google-Smtp-Source: ABdhPJxKyLlecNcvKnB+IH5EiFIIEq6O9ab2QBZ/1MtAk2ElMvsRCBh6ksSG3PlHaSZ7+dQJQMMa3I0WOLUPM8B3JzU= X-Received: by 2002:a63:af07:: with SMTP id w7mr1419725pge.209.1642024400669; Wed, 12 Jan 2022 13:53:20 -0800 (PST) MIME-Version: 1.0 References: <20220105233440.63361-1-mfo@canonical.com> <87v8ypybdc.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: From: Mauricio Faria de Oliveira Date: Wed, 12 Jan 2022 18:53:07 -0300 Message-ID: Subject: Re: [PATCH v2] mm: fix race between MADV_FREE reclaim and blkdev direct IO read To: Minchan Kim Cc: "Huang, Ying" , Yu Zhao , Andrew Morton , linux-mm@kvack.org, linux-block@vger.kernel.org, Miaohe Lin , Yang Shi Content-Type: text/plain; charset="UTF-8" Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=canonical.com header.s=20210705 header.b=CNUxFulc; dmarc=pass (policy=none) header.from=canonical.com; spf=pass (imf05.hostedemail.com: domain of mauricio.oliveira@canonical.com designates 185.125.188.122 as permitted sender) smtp.mailfrom=mauricio.oliveira@canonical.com X-Stat-Signature: m811gz8doydwmiu9qzhkd3mz4xmg9fww X-Rspamd-Queue-Id: 1F334100009 X-Rspamd-Server: rspam12 X-HE-Tag: 1642024405-47751 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Minchan Kim, Thanks for handling the hard questions! :) On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim wrote: > > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote: > > Yu Zhao writes: > > > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: > > >> diff --git a/mm/rmap.c b/mm/rmap.c > > >> index 163ac4e6bcee..8671de473c25 100644 > > >> --- a/mm/rmap.c > > >> +++ b/mm/rmap.c > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > >> > > >> /* MADV_FREE page check */ > > >> if (!PageSwapBacked(page)) { > > >> - if (!PageDirty(page)) { > > >> + int ref_count = page_ref_count(page); > > >> + int map_count = page_mapcount(page); > > >> + > > >> + /* > > >> + * The only page refs must be from the isolation > > >> + * (checked by the caller shrink_page_list() too) > > >> + * and one or more rmap's (dropped by discard:). > > >> + * > > >> + * Check the reference count before dirty flag > > >> + * with memory barrier; see __remove_mapping(). > > >> + */ > > >> + smp_rmb(); > > >> + if ((ref_count - 1 == map_count) && > > >> + !PageDirty(page)) { > > >> /* Invalidate as we cleared the pte */ > > >> mmu_notifier_invalidate_range(mm, > > >> address, address + PAGE_SIZE); > > > > > > Out of curiosity, how does it work with COW in terms of reordering? > > > Specifically, it seems to me get_page() and page_dup_rmap() in > > > copy_present_pte() can happen in any order, and if page_dup_rmap() > > > is seen first, and direct io is holding a refcnt, this check can still > > > pass? > > > > I think that you are correct. > > > > After more thoughts, it appears very tricky to compare page count and > > map count. Even if we have added smp_rmb() between page_ref_count() and > > page_mapcount(), an interrupt may happen between them. During the > > interrupt, the page count and map count may be changed, for example, > > unmapped, or do_swap_page(). > > Yeah, it happens but what specific problem are you concerning from the > count change under race? The fork case Yu pointed out was already known > for breaking DIO so user should take care not to fork under DIO(Please > look at O_DIRECT section in man 2 open). If you could give a specific > example, it would be great to think over the issue. > > I agree it's little tricky but it seems to be way other place has used > for a long time(Please look at write_protect_page in ksm.c). Ah, that's great to see it's being used elsewhere, for DIO particularly! > So, here what we missing is tlb flush before the checking. That shouldn't be required for this particular issue/case, IIUIC. One of the things we checked early on was disabling deferred TLB flush (similarly to what you've done), and it didn't help with the issue; also, the issue happens on uniprocessor mode too (thus no remote CPU involved.) > > Something like this. > > diff --git a/mm/rmap.c b/mm/rmap.c > index b0fd9dc19eba..b4ad9faa17b2 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > /* MADV_FREE page check */ > if (!PageSwapBacked(page)) { > - int refcount = page_ref_count(page); > - > - /* > - * The only page refs must be from the isolation > - * (checked by the caller shrink_page_list() too) > - * and the (single) rmap (dropped by discard:). > - * > - * Check the reference count before dirty flag > - * with memory barrier; see __remove_mapping(). > - */ > - smp_rmb(); > - if (refcount == 2 && !PageDirty(page)) { > + if (!PageDirty(page) && > + page_mapcount(page) + 1 == page_count(page)) { In the interest of avoiding a different race/bug, it seemed worth following the suggestion outlined in __remove_mapping(), i.e., checking PageDirty() after the page's reference count, with a memory barrier in between. I'm not familiar with the details of the original issue behind that code change, but it seemed to be possible here too, particularly as writes from user-space can happen asynchronously / after try_to_unmap_one() checked PTE clean and didn't set PageDirty, and if the page's PTE is present, there's no fault? Thanks again, Mauricio > /* Invalidate as we cleared the pte */ > mmu_notifier_invalidate_range(mm, > address, address + PAGE_SIZE); > diff --git a/mm/vmscan.c b/mm/vmscan.c > index f3162a5724de..6454ff5c576f 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1754,6 +1754,9 @@ static unsigned int shrink_page_list(struct list_head *page_list, > enum ttu_flags flags = TTU_BATCH_FLUSH; > bool was_swapbacked = PageSwapBacked(page); > > + if (!was_swapbacked && PageAnon(page)) > + flags &= ~TTU_BATCH_FLUSH; > + > if (unlikely(PageTransHuge(page))) > flags |= TTU_SPLIT_HUGE_PMD; > > > > > > -- Mauricio Faria de Oliveira