From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 97244C48291 for ; Fri, 2 Feb 2024 10:52:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1350F6B0074; Fri, 2 Feb 2024 05:52:58 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0BD6D6B007B; Fri, 2 Feb 2024 05:52:58 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E9FB26B007D; Fri, 2 Feb 2024 05:52:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id D56AA6B0074 for ; Fri, 2 Feb 2024 05:52:57 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id AECC740ED9 for ; Fri, 2 Feb 2024 10:52:57 +0000 (UTC) X-FDA: 81746551194.05.35A7069 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf30.hostedemail.com (Postfix) with ESMTP id 151F980017 for ; Fri, 2 Feb 2024 10:52:55 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="DOzrgy/W"; spf=pass (imf30.hostedemail.com: domain of ming.lei@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=ming.lei@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706871176; a=rsa-sha256; cv=none; b=7ptziUxdQVepoBSadUw0eQzU4J/VQXspNXxfbSDfZ6EDx0sZ2zqX/sd4BGT+Q9lfSYLoag xXL9c6I7NRzUzAixFBioN1MQWJ0yUp/KgiHck90k9O6GyV5vcQOK37Oecjt8OkcJJUpr36 XBtCpgB7LKMgt4/qbprW4DsflnQndEg= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="DOzrgy/W"; spf=pass (imf30.hostedemail.com: domain of ming.lei@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=ming.lei@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706871176; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ZClaqx53ZJJu0ie6KUCf9aABm/H96wl0onIl5m7Has8=; b=ip79fM5iOcp/ixpPtBAA4cCGFxJfqas7wv4O1cSZhM0fS5CprUNKm85tqdeY9wou98Mk38 tb168Dj8vHKiBFnY9UUy5RROXhiUjCMkuauKLb+zqzKetvd+OViirxtIyxkfDQPjAWd1lw OMSwF6pt8jxoiR4BVMc1wTBej7eBBdg= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1706871175; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=ZClaqx53ZJJu0ie6KUCf9aABm/H96wl0onIl5m7Has8=; b=DOzrgy/W08sf3mtiqa3OTy0qrOrN8lADr1U41ZRuGcRJUxE1bPxYVT/mcYK66VWYP7hBjz hhT5MDffNkZdyUVk4Gj0oaafuF+Po1SetQW4B4kcca2+s+rgJwogS22ztlIf53l0a8THJ3 ou7aTa8hpvmRhre6EwqFRqP9aODhyXk= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-659-iW2aeLZvPEy-2p72J5_cXA-1; Fri, 02 Feb 2024 05:52:52 -0500 X-MC-Unique: iW2aeLZvPEy-2p72J5_cXA-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 71E5385A588; Fri, 2 Feb 2024 10:52:51 +0000 (UTC) Received: from fedora (unknown [10.72.116.16]) by smtp.corp.redhat.com (Postfix) with ESMTPS id D9E5640C95AD; Fri, 2 Feb 2024 10:52:45 +0000 (UTC) Date: Fri, 2 Feb 2024 18:52:22 +0800 From: Ming Lei To: Mike Snitzer Cc: Andrew Morton , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, David Hildenbrand , Matthew Wilcox , Alexander Viro , Christian Brauner , Don Dutile , Rafael Aquini , Dave Chinner Subject: Re: mm/madvise: set ra_pages as device max request size during ADV_POPULATE_READ Message-ID: References: <20240202022029.1903629-1-ming.lei@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2 X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 151F980017 X-Stat-Signature: e9msiyspzh3ttc3614m84snarcc1qir9 X-Rspam-User: X-HE-Tag: 1706871175-750325 X-HE-Meta: U2FsdGVkX19+GBVqajig69ChPIDLdngRBZiTnevVJyL7rR3X7DoEl+lXN5sAxxV1x82p78v7hKnN70OhV1R/RXxcYzPAnacCfeUeseJ6VxUIBF6dEnMQA39twYAmUMX3eLKc2G8KT5UJ88IuisM95MKifR/sDsaynK0AlFzFFK+7dg1nQssVYUVxj24it3+NpxVw0BSxNzOVRHIDH7evT+MKp+k2TNv5ZPem8AxsNoNYcGI75DfdZcmfCD3sB6WDfgb8QAm7e5+4ewQ52NZhQPqJhcBYk7ZJoZvVpzaw/DZo7ijQ5JidJsLd+0cbLUruH8XXI2UHQ6zhyQFYEm0mZBPojQ9zs03VVgjLog7W/i+VAc7aEmVI4JVqbubZE1ljiZ+bu5qMfxIDH19cDEfKgkun+FyrmtCnY4ntVLr0Y0ttj8MLlyoHlh24JntMzKoQrnkLJUHZqFLMwDrtzDYfkvggjkJEArBv5dFqoJ30nsevTtAXUwDz4HHCPdxAo/4DPhgjlZYAOfTYfMmnljB0Y5MD1Q+xGJpd2ylJaMoaS2ulPwbv8NOfuf+bia4nD0+Re+KDPtLzVGKsgs+eb3eXj9S7oqWxU9SlBzErz8bWRGwaUJC553EpXfJ27Z/eBIjt/UaVJg7NWUPo94W3NQHdVKWB4Za221aqJe/z5Ih7ofkh2a0mi7+At/cvtldE/AVLKHjTBj7cvzcDQe+NE8JFCHQULnJvtwl8a4UYxMyRjcHHH+6MOQTMppQoi0ZrQtImAjqG41eOUdiIvem2uU5ZoLMlogxKM7hCgKqzWFyOWcxHJHSx5waYNaYGIst1Osj47fIt9E1hqaeT9dXh3b0Yzi5qPZzoOnkwvblVaqO9p6NpCTvx4M2NgBV1f8mfok8njBD4UBFBpOg5RexNSc2GkMBYi1DLy8Pd X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Feb 01, 2024 at 11:43:11PM -0500, Mike Snitzer wrote: > On Thu, Feb 01 2024 at 9:20P -0500, > Ming Lei wrote: > > > madvise(MADV_POPULATE_READ) tries to populate all page tables in the > > specific range, so it is usually sequential IO if VMA is backed by > > file. > > > > Set ra_pages as device max request size for the involved readahead in > > the ADV_POPULATE_READ, this way reduces latency of madvise(MADV_POPULATE_READ) > > to 1/10 when running madvise(MADV_POPULATE_READ) over one 1GB file with > > usual(default) 128KB of read_ahead_kb. > > > > Cc: David Hildenbrand > > Cc: Matthew Wilcox > > Cc: Alexander Viro > > Cc: Christian Brauner > > Cc: Don Dutile > > Cc: Rafael Aquini > > Cc: Dave Chinner > > Cc: Mike Snitzer > > Cc: Andrew Morton > > Signed-off-by: Ming Lei > > --- > > mm/madvise.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++- > > 1 file changed, 51 insertions(+), 1 deletion(-) > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 912155a94ed5..db5452c8abdd 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -900,6 +900,37 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > > return -EINVAL; > > } > > > > +static void madvise_restore_ra_win(struct file **file, unsigned int ra_pages) > > +{ > > + if (*file) { > > + struct file *f = *file; > > + > > + f->f_ra.ra_pages = ra_pages; > > + fput(f); > > + *file = NULL; > > + } > > +} > > + > > +static struct file *madvise_override_ra_win(struct file *f, > > + unsigned long start, unsigned long end, > > + unsigned int *old_ra_pages) > > +{ > > + unsigned int io_pages; > > + > > + if (!f || !f->f_mapping || !f->f_mapping->host) > > + return NULL; > > + > > + io_pages = inode_to_bdi(f->f_mapping->host)->io_pages; > > + if (((end - start) >> PAGE_SHIFT) < io_pages) > > + return NULL; > > + > > + f = get_file(f); > > + *old_ra_pages = f->f_ra.ra_pages; > > + f->f_ra.ra_pages = io_pages; > > + > > + return f; > > +} > > + > > Does this override imply that madvise_populate resorts to calling > filemap_fault() and here you're just arming it to use the larger > ->io_pages for the duration of all associated faulting? Yes. > > Wouldn't it be better to avoid faulting and build up larger page How can we avoid the fault handling? which is needed to build VA->PA mapping. > vectors that get sent down to the block layer in one go and let the filemap_fault() already tries to allocate folio in big size(max order is MAX_PAGECACHE_ORDER), see page_cache_ra_order() and ra_alloc_folio(). > block layer split using the device's limits? (like happens with > force_page_cache_ra) Here filemap code won't deal with block directly because there is VFS & FS and io mapping is required, and it just calls aops->readahead() or aops->read_folio(), but block plug & readahead_control are applied for handling everything in batch. > > I'm concerned that madvise_populate isn't so efficient with filemap That is why this patch increases readahead window, then madvise_populate() performance can be improved by X10 in big file-backed popluate read. > due to excessive faulting (*BUT* I haven't traced to know, I'm just > inferring that is why twiddling f->f_ra.ra_pages helps improve > madvise_populate by having it issue larger IO. Apologies if I'm way > off base) As mentioned, fault handling can't be avoided, but we can improve involved readahead IO perf. Thanks, Ming