From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7E798C433F5 for ; Wed, 12 Jan 2022 00:01:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BA18B6B00F9; Tue, 11 Jan 2022 19:01:12 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B29766B00FB; Tue, 11 Jan 2022 19:01:12 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9A4506B00FC; Tue, 11 Jan 2022 19:01:12 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0117.hostedemail.com [216.40.44.117]) by kanga.kvack.org (Postfix) with ESMTP id 844BA6B00F9 for ; Tue, 11 Jan 2022 19:01:12 -0500 (EST) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 45B8F181C5573 for ; Wed, 12 Jan 2022 00:01:12 +0000 (UTC) X-FDA: 79019679984.19.E432F2C Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) by imf30.hostedemail.com (Postfix) with ESMTP id 95C408000F for ; Wed, 12 Jan 2022 00:01:11 +0000 (UTC) Received: by mail-pl1-f172.google.com with SMTP id h1so1309244pls.11 for ; Tue, 11 Jan 2022 16:01:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=72DjrFB46ccE7PgvOtFUW6osXTrjWGV/9wfVer6QRWI=; b=SrGKT6YIfMXA8iCawqnOatf3gyv8mY+IHiM8oZ2c7xSKn6LIeGUWHchOaynAob/fqm s/P4ZLB+LOyHpPzZ5g6OWTaISa/MwfwB6S/+fg8EADLKlwSxYiBVsrkRVGMGxdVtAJgT Adxn9lsZvDdteWn4trFUhxneA36WWS7BoHMc4tqJnFF8/VECZ3IZmjNz7mPCblGzgDDc FdEF0jdhKGoiUcnPNI9xEB45kfK02y7+HajE3sNCtKgd/f4EOJ18ILaypMQWlcAmBS8Q UxNfg31t5StzcRvTjLxkmR7/VF2BEGSa44eQi7DAS7nhSm4/hwK0WF8v+R4lgWqeP+Us ozxw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition :content-transfer-encoding:in-reply-to; bh=72DjrFB46ccE7PgvOtFUW6osXTrjWGV/9wfVer6QRWI=; b=r9b9VVezPuQmGcQeOPuHzwvH79D9DwCdetTy37Kz2BVunx4RrCZJ022pAibhDwiUvJ 7IjEFAKhib7B/B9B8pTKpSve5QN0YKrdypIy36H8AjuXUBrMq5g/1E2+W6FOBVmruw83 0O83FE3GdIUI6sQjzKfe4TkInZDZryJ+2wj90LnTdrbYaktxHHxdY+fwNn15RPkRf4xB ygHj1R+mdE4dKqhlaTDFEe4hk3Xf8QSp7wwetLuYVUAwY59XevyODh7kAdQayePU1dUR gDndT0cP1WnnVKcUBDhIJs9WGJdHJsETSgmzhJidB/Mbj27xB4x21V+a32D8bjHBKRBr b5cw== X-Gm-Message-State: AOAM532wcdQ84MXPBePPsLpVOblFe4w0uggKnzl7FA9kWsoQtZypNlyD RmTDeYF3U9hkQu4C8YgaSog= X-Google-Smtp-Source: ABdhPJyHEaxX5rDvVsr8if/O0BHZoabBbIqlbgJGVmpHhgUmZOB0RB8mbyfeDjG1+peavMGMG/8vkA== X-Received: by 2002:a63:b909:: with SMTP id z9mr6282962pge.26.1641945670424; Tue, 11 Jan 2022 16:01:10 -0800 (PST) Received: from google.com ([2620:15c:211:201:4f0e:ffc8:3f7b:ac89]) by smtp.gmail.com with ESMTPSA id h5sm11805982pfi.46.2022.01.11.16.01.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Jan 2022 16:01:08 -0800 (PST) Date: Tue, 11 Jan 2022 16:01:06 -0800 From: Minchan Kim To: John Hubbard Cc: Yu Zhao , Mauricio Faria de Oliveira , Andrew Morton , linux-mm@kvack.org, linux-block@vger.kernel.org, Huang Ying , Miaohe Lin , Yang Shi Subject: Re: [PATCH v2] mm: fix race between MADV_FREE reclaim and blkdev direct IO read Message-ID: References: <20220105233440.63361-1-mfo@canonical.com> <7094dbd6-de0c-9909-e657-e358e14dc6c3@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <7094dbd6-de0c-9909-e657-e358e14dc6c3@nvidia.com> Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=SrGKT6YI; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none); spf=pass (imf30.hostedemail.com: domain of minchan.kim@gmail.com designates 209.85.214.172 as permitted sender) smtp.mailfrom=minchan.kim@gmail.com X-Stat-Signature: quaohhewjjnmhbxdm6fbuu1hhs6d7jn9 X-Rspamd-Queue-Id: 95C408000F X-Rspamd-Server: rspam12 X-HE-Tag: 1641945671-476886 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Jan 11, 2022 at 03:38:24PM -0800, John Hubbard wrote: > On 1/11/22 13:59, Minchan Kim wrote: > ... > > > > > Marking pages dirty after pinning them is a pre-existing area o= f > > > > > problems. See the long-running LWN articles about get_user_page= s() [1]. > > > >=20 > > > > Oh, Do you mean marking page dirty in DIO path is already problem= s? > > >=20 > > > ^ marking page dirty too late in DIO path > > >=20 > > > Typo fix. > >=20 > > I looked though the articles but couldn't find dots to connetct > > issues with this MADV_FREE issue. However, man page shows a clue >=20 > The area covered in those articles is about the fact that file system > and block are not safely interacting with pinned memory. Even today. > So I'm trying to make sure you're aware of that before you go too far > in that direction. >=20 > > why it's fine. > >=20 > > ``` > > O_DIRECT I/Os should never be run concurrently with the fork= (2) system call, if the memory buffer is a private map=E2=80=90 > > ping (i.e., any mapping created with the mmap(2) MAP_PRIVATE = flag; this includes memory allocated on the heap and > > statically allocated buffers). Any such I/Os, whether subm= itted via an asynchronous I/O interface or from another > > thread in the process, should be completed before fork(2) is = called. Failure to do so can result in data corruption > > and undefined behavior in parent and child processes. > >=20 > > ``` > >=20 > > I think it would make the copy_present_pte's page_dup_rmap safe. >=20 > I'd have to see this in patch form, because I'm not quite able to visua= lize it yet. It would be great if you read though the original patch description. Since v2 had a little change to consider mutiple maps among parent and child, it would introduce a little mistmatch with the description but it's still quite good to explain current problem. https://lore.kernel.org/all/20220105233440.63361-1-mfo@canonical.com/T/#u Problem is MADV_FREEed anonymous memory is supposed to work based on dirtiness came from the user process's page table bit or PageDirty. Since VM can't see the dirty, it just discards the anonymous memory instead of swappoing out. Thus, the dirtiness is key to work correctly. However, DIO didn't make the page Dirty yet until IO is completed and at the same time, the store operation didn't go though via user process's page table regardless of DMA or other way. It makes VM could decide just drop the page since it didn't see any dirtiness from the page. So it turns out enduser would be surprised because the read syscall with DIO was completed but the data was zero rather than latest uptodate data. To prevent the problem, the patch suggested to compare page_mapcount with page_count since it expects any additional reference of the page means someone is doing accessing the memory so in this case, not discarding the page. However, Yu pointed out page_count and page_mapcount could be reordered in copy_page_range, for example. So I am looking for the solution(one would be adding memory barrier right before page_dup_rmap but I'd like to avoid it if we have other idea). And then man page says forking under going DIO would be already prohibited so the concern raised would be void, IIUC. Hope this helps your understanding. Thanks! work=20