From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.5 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E3FCC433E0 for ; Tue, 9 Feb 2021 02:15:10 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2FE6264EBA for ; Tue, 9 Feb 2021 02:15:09 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2FE6264EBA Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.dk Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9AFC18D0001; Mon, 8 Feb 2021 21:15:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 960B46B0074; Mon, 8 Feb 2021 21:15:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 84D918D0001; Mon, 8 Feb 2021 21:15:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0112.hostedemail.com [216.40.44.112]) by kanga.kvack.org (Postfix) with ESMTP id 692DF6B0073 for ; Mon, 8 Feb 2021 21:15:08 -0500 (EST) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 1B94E1839A13B for ; Tue, 9 Feb 2021 02:15:08 +0000 (UTC) X-FDA: 77797111896.17.blow19_3c0f15d27603 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin17.hostedemail.com (Postfix) with ESMTP id EC294183459DC for ; Tue, 9 Feb 2021 02:15:07 +0000 (UTC) X-HE-Tag: blow19_3c0f15d27603 X-Filterd-Recvd-Size: 8675 Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) by imf50.hostedemail.com (Postfix) with ESMTP for ; Tue, 9 Feb 2021 02:15:07 +0000 (UTC) Received: by mail-pl1-f172.google.com with SMTP id e9so8889312plh.3 for ; Mon, 08 Feb 2021 18:15:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=NXF5mAWovbyAD4R2NQnySI3fHtBI6fQfDgdgbFNzh3s=; b=vEHPzOQW0z0Vj57q8EnpJC1H3Tq7Mu8FzJPV6bIWJa5XLlcOUtVY59k5E3ivawmfDj gVO4Vxa5yqAMcd0w4uWA/Cb66YcGqFLLCYk+NikQKTCo+ZdMF3lv1RuY+8CNY4bQ8Ybd reZm7lgU/SO83hahixkMv0m+HSwD0ht4lxDrlud/rfi3OBn5p78u6vhCx9p5AUeNGRU0 1DIT8jJ9VPvSzBKKfDdOAFXN3W5F02AZ6AQxHQ0bZPTUXD8d5naNr9GZ4i0eoa9XP6Bj 8DylUEG8fRk0Mqv2Kye6ZkyimU8cku1nzFKifBnXADkT35qEG5EpbcLLopvq2sAvWPru 8ccg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=NXF5mAWovbyAD4R2NQnySI3fHtBI6fQfDgdgbFNzh3s=; b=VkzFb8v27oO3lbow2VWYsxm0OX1Y0yG/1/CrJeRdXUSzzolSvZokh9YjpqQwcjcHGR aGSj6NLh40y3D/drBDTS2mNRRmqctFi5MblRCnJWuh7FXRFzQoQcb+nK8a2ZrC0Sxcen IuJLsJQMgRSsdaZlog9xvQwbH9wCqi9suMjisCX04DTUOCZ47seXrwdhe8Rjo3OzdeZa 0Cdect+WBYexD95m8gNJsuu5OEbV7L6LWwF4ZK6pgEhsdOJGImOhDM8uWV7tnO71pF3u b8Bqpfr6dCsfk/B8Wl+JNj8jfbBaBCFIax4NagihlHIF38nEfiI0xAm1OMfRBycOseko Z0+A== X-Gm-Message-State: AOAM531VP6e3F5m71A+mYMYBRPujZ8sjaW1KUsAu2JBf42/sNjJfTuGJ TwyMI0l9DGln0TNqUJo2PGMcAQ== X-Google-Smtp-Source: ABdhPJxnnemMqIkc+XgVs3kxrwTInx0C4AkYjo9PzvierTtrd+sRGNpqmslp1t2tOtYKjE6oDE0l7Q== X-Received: by 2002:a17:90a:cb15:: with SMTP id z21mr1725692pjt.164.1612836905962; Mon, 08 Feb 2021 18:15:05 -0800 (PST) Received: from [192.168.1.134] ([66.219.217.173]) by smtp.gmail.com with ESMTPSA id r5sm286894pfh.13.2021.02.08.18.15.04 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 08 Feb 2021 18:15:05 -0800 (PST) Subject: Re: [PATCHSET 0/3] Improve IOCB_NOWAIT O_DIRECT To: Dave Chinner Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, hch@infradead.org, akpm@linux-foundation.org References: <20210208221829.17247-1-axboe@kernel.dk> <20210208232846.GO4626@dread.disaster.area> <44fec531-b2fd-f569-538a-64449a5c371b@kernel.dk> <20210209001445.GP4626@dread.disaster.area> From: Jens Axboe Message-ID: Date: Mon, 8 Feb 2021 19:15:03 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: <20210209001445.GP4626@dread.disaster.area> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2/8/21 5:14 PM, Dave Chinner wrote: > On Mon, Feb 08, 2021 at 04:37:26PM -0700, Jens Axboe wrote: >> On 2/8/21 4:28 PM, Dave Chinner wrote: >>> On Mon, Feb 08, 2021 at 03:18:26PM -0700, Jens Axboe wrote: >>>> Hi, >>>> >>>> Ran into an issue with IOCB_NOWAIT and O_DIRECT, which causes a rather >>>> serious performance issue. If IOCB_NOWAIT is set, the generic/iomap >>>> iterators check for page cache presence in the given range, and return >>>> -EAGAIN if any is there. This is rather simplistic and looks like >>>> something that was never really finished. For !IOCB_NOWAIT, we simply >>>> call filemap_write_and_wait_range() to issue (if any) and wait on the >>>> range. The fact that we have page cache entries for this range does >>>> not mean that we cannot safely do O_DIRECT IO to/from it. >>>> >>>> This series adds filemap_range_needs_writeback(), which checks if >>>> we have pages in the range that do require us to call >>>> filemap_write_and_wait_range(). If we don't, then we can proceed just >>>> fine with IOCB_NOWAIT. >>> >>> Not exactly. If it is a write we are doing, we _must_ invalidate >>> the page cache pages over the range of the DIO write to maintain >>> some level of cache coherency between the DIO write and the page >>> cache contents. i.e. the DIO write makes the page cache contents >>> stale, so the page cache has to be invalidated before the DIO write >>> is started, and again when it completes to toss away racing updates >>> (mmap) while the DIO write was in flight... >>> >>> Page invalidation can block (page locks, waits on writeback, taking >>> the mmap_sem to zap page tables, etc), and it can also fail because >>> pages are dirty (e.g. writeback+invalidation racing with mmap). >>> >>> And if it fails because dirty pages then we fall back to buffered >>> IO, which serialises readers and writes and will block. >> >> Right, not disagreeing with any of that. >> >>>> The problem manifested itself in a production environment, where someone >>>> is doing O_DIRECT on a raw block device. Due to other circumstances, >>>> blkid was triggered on this device periodically, and blkid very helpfully >>>> does a number of page cache reads on the device. Now the mapping has >>>> page cache entries, and performance falls to pieces because we can no >>>> longer reliably do IOCB_NOWAIT O_DIRECT. >>> >>> If it was a DIO write, then the pages would have been invalidated >>> on the first write and the second write would issued with NOWAIT >>> just fine. >>> >>> So the problem sounds to me like DIO reads from the block device are >>> not invalidating the page cache over the read range, so they persist >>> and prevent IOCB_NOWAIT IO from being submitted. >> >> That is exactly the case I ran into indeed. >> >>> Historically speaking, this is why XFS always used to invalidate the >>> page cache for DIO - it didn't want to leave cached clean pages that >>> would prevent future DIOs from being issued concurrently because >>> coherency with the page cache caused performance issues. We >>> optimised away this invalidation because the data in the page cache >>> is still valid after a flush+DIO read, but it sounds to me like >>> there are still corner cases where "always invalidate cached pages" >>> is the right thing for DIO to be doing.... >>> >>> Not sure what the best way to go here it - the patch isn't correct >>> for NOWAIT DIO writes, but it looks necessary for reads. And I'm not >>> sure that we want to go back to "invalidate everything all the time" >>> either.... >> >> We still do the invalidation for writes with the patch for writes, >> nothing has changed there. We just skip the >> filemap_write_and_wait_range() if there's nothing to write. And if >> there's nothing to write, _hopefully_ the invalidation should go >> smoothly unless someone dirtied/locked/put-under-writeback the page >> since we did the check. But that's always going to be racy, and there's >> not a whole lot we can do about that. > > Sure, but if someone has actually mapped the range and is accessing > it, then PTEs will need zapping and mmap_sem needs to be taken in > write mode. If there's continual racing access, you've now got the > mmap_sem regularly being taken exclusively in the IOCB_NOWAIT path > and that means it will get serialised against other threads in the > task doing page faults and other mm context operations. The "needs > writeback" check you've added does nothing to alleviate this > potential blocking point for the write path. > > That's my point - you're exposing obvious new blocking points for > IOCB_NOWAIT DIO writes, not removing them. It may not happen very > often, but the whack-a-mole game you are playing with IOCB_NOWAIT is > "we found an even rarer blocking condition that it is critical to > our application". While this patch whacks this specific mole in the > read path, it also exposes the write path to another rare blocking > condition that will eventually end up being the mole that needs to > be whacked... > > Perhaps the needs-writeback optimisation should only be applied to > the DIO read path? Sure, we can do that as a first version, and then tackle the remainder of the write side as a separate thing as we need to handle invalidate inode pages separately too. I'll send out a v2 with just the read side. -- Jens Axboe