From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 25170D41C27 for ; Wed, 13 Nov 2024 09:54:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 762156B00B1; Wed, 13 Nov 2024 04:54:54 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 712E36B00B3; Wed, 13 Nov 2024 04:54:54 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5DAB06B00B5; Wed, 13 Nov 2024 04:54:54 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 3B8626B00B1 for ; Wed, 13 Nov 2024 04:54:54 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id D82FEC08EB for ; Wed, 13 Nov 2024 09:54:53 +0000 (UTC) X-FDA: 82780612446.02.7D85FB9 Received: from mail-qv1-f54.google.com (mail-qv1-f54.google.com [209.85.219.54]) by imf02.hostedemail.com (Postfix) with ESMTP id 36EDF80011 for ; Wed, 13 Nov 2024 09:53:31 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QbmK5mbv; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf02.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.54 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731491629; a=rsa-sha256; cv=none; b=tc1Y7NjccVIoZapdW+TM71MUbwyxl+mgu9Oqtw7bwhYtc7wmNG54/AgDVbKTEFzjJ44rUy qWpBN3j2C+Frc3eR5s9CHOtNkdUa01JDv/oRrioQqVZwbC9hcWN33GaMtFD5EKcsS/RMiJ 1K4/J8V7jRV1IWrKJt5AY4uNLlRZqPU= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QbmK5mbv; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf02.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.54 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731491629; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2U1TtglPChqxh7IcjnLyAkelVEXzu6CDrucn0Vd4ODU=; b=GQC/hpTZ7SGJa8TsZobBQw55LlnpDhGoGV3zcT7JJYsLb1TqWjRdwezd+KOcwgi4gqn3On IuS//QNk1/FmpWTLOMGhD/K3T7baYIF+xNPuMk1tOaIy1QmpARsosDQyTvlSrtrNwqbZpg yvU6vbKZrrX1CJisDgq8jSsJiQwKGrg= Received: by mail-qv1-f54.google.com with SMTP id 6a1803df08f44-6cc250bbc9eso48268616d6.2 for ; Wed, 13 Nov 2024 01:54:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1731491691; x=1732096491; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=2U1TtglPChqxh7IcjnLyAkelVEXzu6CDrucn0Vd4ODU=; b=QbmK5mbv6dAeXJeeeBvzdCquNho63N8a0TwwhjKbnoOlZVlEnKlh3+tqQEp+gA68tm I+zxW/iKRz2NwK0ZYzdAH0TsbIYE46a5+KZjjcZNsNeCpr2oITUPZG0RxQm5a0lJIEeo HDkoT40mgvtpUNmSwOrCr1DtQPXEmpwtHK7WcoFATOEGTDXWiiHjdJUv8At2CBpq1LDN JWRi4PYQXMXH4onXzMG/DU+noiThgMrUD8pj1CfW20MAv871fEARLui5oVx4HZvi096K 13ook8msqDPOlu1pCv6dCINHQtLcFcEhmoxHJf/qntuJdq42Uk84pqAtr32En0SiFJ5g EosQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731491691; x=1732096491; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=2U1TtglPChqxh7IcjnLyAkelVEXzu6CDrucn0Vd4ODU=; b=UVejWnvikJ3TnYQ1yy/gRBVNcRn6K9uVUWlOHmdRwFeGzMtMxP0HX0tvxLDF66JJRh Iv414sE+Lb1PKJWqapQ7hSzyqNRrPqvjZ38jXTYkU3ktevxYBjVQyMGIcdVjiYY1NlGB KD9BMbhm4ggIFbW/J0CmqgCLl0RJOVAEg/HYxgem0x+AWn1dnrMEfuSqMD9Lq5qkJl9v PApqL8+oRriOp41tUCTmGid1N/u73bLSXXmJH+x2L2VfK7PENNc5SQ+GJI18rkZIDWwT qlo55jc+FLOk5Hg+SB8LuHS5UMJwDNMK7Cf5hcWDfAGPavlphFk/Fkg+glN/Sd/mPZoS RLpQ== X-Forwarded-Encrypted: i=1; AJvYcCVjLi+Eqz5NMz2PBPSifBKqi077rcZsU1yc/2KnfSHxVdzmsE10XjVn5flMoBXzUlGAALNnwEV+kA==@kvack.org X-Gm-Message-State: AOJu0YyCdP7Iqab14aDLGV3gZ/JGXppJO1fC01e0+DJ8AAvEe3ID8rTE 7ayCOg7k07+pQ13rUorjZgjUMCx2uIXUWuJj+YMjtEHRWtqP+7jIVvplBrjP6osr6PyNqih+VUB 5utG4FII+bygWK1ZgPAB7yV2Uj4A= X-Google-Smtp-Source: AGHT+IF5TGvGNmx7N7qk61VCAvERtj2ZAL2cxQR/6GcZQD29Mtuv6fIeoKuuXXpXlvJkM6z9HcM+wkB9J7Bh3IKiIe8= X-Received: by 2002:a05:6214:540e:b0:6d3:9c7b:9384 with SMTP id 6a1803df08f44-6d3dd07ebd6mr29570546d6.42.1731491691133; Wed, 13 Nov 2024 01:54:51 -0800 (PST) MIME-Version: 1.0 References: <20241108141710.9721-1-laoar.shao@gmail.com> <85cfc467-320f-4388-b027-2cbad85dfbed@redhat.com> <88211032-80e1-4067-a74c-c9dcea1abff8@redhat.com> In-Reply-To: <88211032-80e1-4067-a74c-c9dcea1abff8@redhat.com> From: Yafang Shao Date: Wed, 13 Nov 2024 17:54:15 +0800 Message-ID: Subject: Re: [PATCH v2] mm/readahead: Fix large folio support in async readahead To: David Hildenbrand Cc: willy@infradead.org, akpm@linux-foundation.org, linux-mm@kvack.org, stable@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 36EDF80011 X-Rspamd-Server: rspam01 X-Stat-Signature: wnoij3s8mcb71ryfhiihhgefmstqab91 X-HE-Tag: 1731491611-236593 X-HE-Meta: U2FsdGVkX190jq06Xb5kXCJJVdEtIADrztFSbCTHqfPWgxDqvhkTD8wJw8NizbdsWE+nfIUYNgmaO0QR4Of77GDv29nQdk8jSdUC+x/Jt88I/Ih7RI5rQrMd2rpDUMNEq7/yfdQTTHfmYfsrAJL//vHtfJt5RmpKsKW144b3Ff0s4KWAKC362hjKFf+DZT5LvggPRdprWG+ZtzhhGR0GGcAQ2Ja3EttnRQlNxFU9kXVZtjaWBw3Q3yr1pYfXLh4CuCVHwa5Cndm/19Gzxesew7mDDOvjc2K2zNikVPhfTMX8+Mjmm3x8bCtYAk5dWCh9NbqshQiyFBuJMdHM20b8lNNYxCLqOaWIhg4MLFNX3ikO3e+Ao9j+Hkk8cs706H2Y0LCSsz7JXY8gte0zSXwzFby2uMJl4jUD8wP5OZmx2IB9yNQJFAUlU6epPbqzRXn3Lh7oKkgwtKIq1sBYYwj1VfAR2wK+EvNaobTVn2tQyicpXCeOY1heKg5xqfVrqhcgC/p7sRVPmx3XVJckQ/+RuTHrHt6FYwEwR5Jm1mAuynhvm8JBMDAL0OZvGFb3gj/36aCh0p2PW2JFgAldi5t0X/PJNTweB9USlA9N0emdkejofrjt6UlGYrjFCGQbUEC57M9Y9I1WAa/xUkem/B0Fc8aJU7luC8UEX8tz/Ph/hVjPtG5us5XGOHBMPu28/e4CXkCzPFh0WIbTyjA7mvzVA1ia0IU7V6J4p1Pp+n+p0YhMuMY87GKbsHPYNbfptCpLxYql177e+yx0gqj07jVYmT07i6xnRQp/XO6CzSn/T2X9RmkQzqr/sbSt0RRq9IxFH/vVvfKLXu4FZVlfhoaAIvP6dZq583PCLG+39qaLGbke4nfVYXfSHb5Tox+0v2UxoEvd1iOoPKCs7fsFWvw+aFbqPiDkjFJyzM+DkAZEGhufeIm5FcfeuNJPRgrx//GoKk8NUQvuvkR4CdN8KgV tNgXc6Ev o7eYOCvqdVs27/FFrVzIXB2oYau3SOYniGx7ygCBvdmbqePD3S/UfF084AVy8FjAcYwdO0HOyVpDvKUuOqQBUr2Fp3Y6ECGv3ggOnh8tdsBtbdtzt9rMNmDUtvBaQrwdOkAmUQi+vfakDi/0Hy7fjUE22auvg91DW9oilqbpEcTA7vsdavx09qzCSi2C4En7r6ryR4+Qg1h6I+veB1GUuxrlT0+v/MpjZz5iHhBiOgVI/+mvShCKMkK3Ji3PhJ2gd1R5pKCb8gRR25dc2nD4AQ9XpMepwt8IqKOAVxn/1ABCOFzU6rAZBg8ZkYUTh1u2wGDeRBHFlQAC7NIvSKFCL5elaZkqwTopDz30m+gYIQlhFPqhaX5BvQNDZvw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Nov 13, 2024 at 4:28=E2=80=AFPM David Hildenbrand wrote: > > On 13.11.24 03:16, Yafang Shao wrote: > > On Tue, Nov 12, 2024 at 11:19=E2=80=AFPM David Hildenbrand wrote: > >> > >>>> Sorry, but this code is getting quite confusing, especially with suc= h > >>>> misleading "large folio" comments. > >>>> > >>>> Even without MADV_HUGEPAGE we will be allocating large folios, as > >>>> emphasized by Willy [1]. So the only thing MADV_HUGEPAGE controls is > >>>> *which* large folios we allocate. .. as Willy says [2]: "We were onl= y > >>>> intending to breach the 'max' for the MADV_HUGE case, not for all ca= ses." > >>>> > >>>> I have no idea how *anybody* should derive from the code here that w= e > >>>> treat MADV_HUGEPAGE in a special way. > >>>> > >>>> Simply completely confusing. > >>>> > >>>> My interpretation of "I don't know if we should try to defend a stup= id > >>>> sysadmin against the consequences of their misconfiguration like thi= s" > >>>> means" would be "drop this patch and don't change anything". > >>> > >>> Without this change, large folios won=E2=80=99t function as expected. > >>> Currently, to support MADV_HUGEPAGE, you=E2=80=99d need to set readah= ead_kb to > >>> 2MB, 4MB, or more. However, many applications run without > >> > MADV_HUGEPAGE, and a larger readahead_kb might not be optimal for>= them. > >> > >> Someone configured: "Don't readahead more than 128KiB" > >> > >> And then we complain why we "don't readahead more than 128KiB". > > > > That is just bikeshielding. > > It's called "reading the documentation and trying to make sense of a > patch". ;) > > > > > So, what=E2=80=99s your suggestion? Simply setting readahead_kb to 2MB?= That > > would almost certainly cause issues elsewhere. > > I'm not 100% sure. I'm trying to make sense of it all. > > And I assume there is a relevant difference now between "readahead 2M > using all 4k pages" and "readahead 2M using a single large folio". > > I agree that likely readahead using many 4k pages is a worse idea than > just using a single large folio ... if we manage to allocate one. And > it's all not that clear in the code ... > > FWIW, I looked at "read_ahead_kb" values on my Fedora40 notebook and > they are all set to 128KiB. I'm not so sure if they really should be > that small ... It depends on the use case. For our hardop servers, we set it to 4MB, as they prioritize throughput over latency. However, for our Kubernetes servers, we keep it at 128KB since those services are more latency-sensitive, and increasing it could lead to more frequent latency spikes. > or if large folio readahead code should just be able to > exceed it. > > >> "mm/filemap: Support VM_HUGEPAGE for file mappings" talks about "even = if > >> we have no history of readahead being successful". > >> > >> So not about exceeding the configured limit, but exceeding the > >> "readahead history". > >> > >> So I consider VM_HUGEPAGE the sign here to "ignore readahead history" > >> and not to "violate the config". > > > > MADV_HUGEPAGE is definitely a new addition to readahead, and its > > behavior isn=E2=80=99t yet defined in the documentation. All we need to= do is > > clarify its behavior there. The documentation isn=E2=80=99t set in ston= e=E2=80=94we > > can update it as long as it doesn=E2=80=99t disrupt existing applicatio= ns. > > If Willy thinks this is the way to go, then we should document that > MADV_HUGEPAGE may ignore the parameter, agreed. I'll submit an additional patch to update the documentation for MADV_HUGEPA= GE. > > I still don't understand your one comment: > > "It's worth noting that if read_ahead_kb is set to a larger value that > isn't aligned with huge page sizes (e.g., 4MB + 128KB), it may still > fail to map to hugepages." > > Do you mean that MADV_HUGEPAGE+read_ahead_kb<=3D4M will give you 2M pages= , > but MADV_HUGEPAGE+read_ahead_kb>4M won't? Or is this the case without > MADV_HUGEPAGE? Typically, users set read_ahead_kb to aligned sizes, such as 128KB, 256KB, 512KB, 1MB, 2MB, 4MB, or 8MB. With this patch, MADV_HUGEPAGE functions well for all these settings. However, if read_ahead_kb is set to a non-hugepage-aligned size (e.g., 4MB + 128KB), MADV_HUGEPAGE won=E2=80=99t work. This is because the initial readahead size for MADV_HUGEPAGE is set to 4MB, as established in commit 4687fdbb805a: ra->size =3D HPAGE_PMD_NR; if (!(vmf->vma->vm_flags & VM_RAND_READ)) ra->size *=3D 2; However, as Willy noted, non-aligned settings are quite stupid, so we should disregard them. > > If MADV_HUGEPAGE ignores read_ahead_kb completely, it's easy to document. Perhaps, but documenting the behavior of every unusual setting doesn=E2=80= =99t seem practical. > > > > >> > >> But that's just my opinion. > >> > >>> > >>>> > >>>> No changes to API, no confusing code. > >>> > >>> New features like large folios can often create confusion with > >>> existing rules or APIs, correct? > >> > >> We should not try making it even more confusing, if possible. > > > > A quick tip for you: the readahead size already exceeds readahead_kb > > even without MADV_HUGEPAGE. You might want to spend some time tracing > > that behavior. > > Care to save me some time and point me at what you mean? I reached this conclusion by tracing ra->size in each page_cache_ra_order() call, but I=E2=80=99m not fully equipped to provide a= ll the details ;=EF=BC=89 -- Regards Yafang