From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 28043C3DA49 for ; Tue, 23 Jul 2024 16:19:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2FFB66B00D4; Tue, 23 Jul 2024 12:19:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2AF1E6B00D5; Tue, 23 Jul 2024 12:19:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1775B6B00D6; Tue, 23 Jul 2024 12:19:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id ED4DB6B00D4 for ; Tue, 23 Jul 2024 12:19:24 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 6F1081C05F1 for ; Tue, 23 Jul 2024 16:19:24 +0000 (UTC) X-FDA: 82371527448.26.0ED8E45 Received: from mail-ed1-f52.google.com (mail-ed1-f52.google.com [209.85.208.52]) by imf10.hostedemail.com (Postfix) with ESMTP id 855F1C0020 for ; Tue, 23 Jul 2024 16:19:22 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=VDtVTDCK; spf=pass (imf10.hostedemail.com: domain of jglisse@google.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=jglisse@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721751539; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GJuOjrON0XpzagjFVUPr7r/FnZa6LagdRkgEboW1dLg=; b=oijAx1Ll0h3KKMjAQvSiXV/K0DY/aa4PDMbw7xoTdTPE6A4ePNLRsYlIxax/EKjnSIVXup eClPVxQoX5B+f3q8JYZf7Cd2awYkOetgmSeqjDMcqCF+Qc8gMWXUEzAHWYge7uxQiZ94Qx g9zfDV7KYSKAz0sip3IdI/JIKEnwUbo= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=VDtVTDCK; spf=pass (imf10.hostedemail.com: domain of jglisse@google.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=jglisse@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721751539; a=rsa-sha256; cv=none; b=72TKJvlotyXcDbX+iMA5+Ia9aXfau5eJx1bo6JXPvppvW8++Ey+FMqSVcm1yd6/cdBEfU7 mFHEbNiUWtgLsohLy57YZYIv71O33+uxeSVdb+llTbQbDj+cSTVP2hrZ9T8+3AmG/SGGEw aOX5Q2SUo0KhWqquMtDZKbmEsmjWQeI= Received: by mail-ed1-f52.google.com with SMTP id 4fb4d7f45d1cf-5a18a5dbb23so17768a12.1 for ; Tue, 23 Jul 2024 09:19:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1721751561; x=1722356361; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=GJuOjrON0XpzagjFVUPr7r/FnZa6LagdRkgEboW1dLg=; b=VDtVTDCKhbwhC249zJqVB7to6w64Ki+3y5TsO+kd/sAPGYiG8HqRXv5Ng8j/KaMocj nJpmtbpMQOyhiAgqD0nMYWe4bByn2jU6TMWJygIpPrSy6l491YiOd2+15SFvzw+VGp0K l3iHGRpxZjRMY7Dbe2Rv7EQHA2/3FldqYiGR07hxU32/m/qNdpnlcjUjH7k0uuMUnr78 5sFX5xmrMQ4egfR7vVQmiTS22je8dyoWZb0IXV7FoupMXwmNlH2sQFv00poNzQPTgayZ 51nstdnFG7v8KBASf+qI96oHLUICZGGLvR9elOa/GFI38V65vsDAwzfVFf4rrEozlsF6 40+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721751561; x=1722356361; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=GJuOjrON0XpzagjFVUPr7r/FnZa6LagdRkgEboW1dLg=; b=t3ScAeh5wMM6aUNMifi3q/1/9FFQJKO82M1Rl5rJf/Doeoc7ly8TOzR8Rkei5AiK4s Wq1H23kuoDhuFbIh9Ck4k8Y9naVTjrVKHKLezWP2YTgEZ5kw9fp0DJPuFQMhH8aeqA4C 3NkTCVaFSenx+w92EYQWTSqDL7N7r+1JzkCTfTgc443YcMm/jhLgIpTIEfpHAc/bnYOM oaB9npEpdPLU4wr9bCt7qDyoAoc1sawEKdeAfwAnHEly87KyFzAyH1W1yhqH/KdHPwN8 L4POASqDRAeTi2ke5RolNHPpYgj5Pc/XxbSkxKkrNC1g+DwKo0Cv8EolyYvcpHs/gtGq Qi4w== X-Forwarded-Encrypted: i=1; AJvYcCXaKSeZv5QX6nlUMsMXgsXzjffSjePIyplYoA0dDO9nDbPd8Q7Q7H2dFjrPcU1j6dtwWy/rwtJ/sPj7k4uh+fXPoZU= X-Gm-Message-State: AOJu0YzxKpJzuz0VXSXrTN3OzSkyq2mFw5a89SfqzEXfnpVXmoPiMeWV uV4+NcUNG9K5cjGJXr7E6Ktz6JPduZn0oYr9H6C6OT6IhLKMrFWMsIyDz0I5ku2UZ9L0q3bb57N mMHXLYXf3NcAinsbmnmk5Fsg16pQ0l/6VbVh/WfmvO4vIsVxzNEgutNs= X-Google-Smtp-Source: AGHT+IEvBPCZ+B0yumFraYNT9QwdnCKc53LlG0bvNw6o0EYdIgvwnZ2BalS7L9f5n8sfiThws1451uIY+LX/KC6apZ8= X-Received: by 2002:a05:6402:35ca:b0:57d:32ff:73ef with SMTP id 4fb4d7f45d1cf-5a456952e27mr618512a12.6.1721751560632; Tue, 23 Jul 2024 09:19:20 -0700 (PDT) MIME-Version: 1.0 References: <20240720173543.897972-1-jglisse@google.com> <0c390494-e6ba-4cde-aace-cd726f2409a1@redhat.com> In-Reply-To: <0c390494-e6ba-4cde-aace-cd726f2409a1@redhat.com> From: Jerome Glisse Date: Tue, 23 Jul 2024 09:19:07 -0700 Message-ID: Subject: Re: [PATCH] mm: fix maxnode for mbind(), set_mempolicy() and migrate_pages() To: David Hildenbrand Cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org Content-Type: multipart/alternative; boundary="0000000000002a4e5c061dec8686" X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: 855F1C0020 X-Stat-Signature: js3co1czyiisi5s5y5sh6ito6struuid X-HE-Tag: 1721751562-375644 X-HE-Meta: U2FsdGVkX19uFQ2UBNVABsqAV9PiFdrrxcp5JGEmxvDs+QqLEY280QvpU57GVRCWY8GKmPN81dach+kFgPy/X79ua20mM8sOZeHnbPJMswsBnQW39ut59Sr8mcVtCliJmGEuk7RUtcY0Rup57015yisl823x3wXouXaphAxjFfnQe3okP60YV7SLikTzpCrqPVGLX5dGtxzams0lSK7Fr2EHCNCfpY9sj2+gCNvETZWqleqZEyc+99Nf9MzZO2t7lmYhmDzGthTXql+6Q0Pn1L0kWblhiwI6CjlvhQDpsFH/pMrawBYc2q1P8h6qcOYp0lYsA/PD6SL0tPgBMLT31imj6hpKMkrQA5oU9ph7xRY/y/ElFOZ7ziwRcE+IGQKzzQnxJHQON1BUm+Q4FfqflFqBc2ByKHZxYTxpQ8K9H55isEGHOBvVeRyq4InsfsfjExP3e+f1fBaYFcTB49ILdJo+B2zGH9rZ6Vm/e47OmpiUDy7+SiXiNbF/q7Th64THPi9Gj9pCRsoOMm93Ayyn57XrEqHu4Pt9C/Lc+suTtsGWYFhxaTmKB18pH9acANrdgz0OAbchBsLkvrloTWNwvGUlMcIHfmVUuG/QYbfliFLdlD8yrf2eZKeylAnPqZvtcVQlGVjBTRkiQMv5q3JjGWZt0pUNsWz6ijvXjuYBchflPZxa6WdT/hu2lWG6wOiawhmpWa4OTw7YGX31r0gN+u0RkitzuOR8NYeLXQZuQvX3ko4Ga18ew6dcQBg2i5jr0Y+49YbEYlWHki9Fnxgj1sICLThSRNLAj4M4Ws0qa8NaZyiVDbmd0DEW3xCpy4/IhXN4IfYNipHfdnp/Su11Ja+h55LotpMXMp6xpalwf0qeNUoi3KsS+1cZ2UP+PyCYG2zYScOzGH5YZp0fFS59hy4e72NBfgr7LpRYF1XdXcjlI+SEqTG2LmmyIBJWBEgt/LYKuNmbmN/mnOaDAcw 7SmsIxYS 7EXjBmjRHu0G/aoU/4WsT26fXO8umZTVvYeAsjVW7PlRa1qEHgeXBg4O2TgqHz6W22N4QrCI9ncYdSY6KlK2f2BNGdP5bjm/eDrsLKKL86exWMyX+pqD+p7iCTvLk5sV2DI83y5yZRAxtJxQx9hJqlCMb1WER1q7Rna0i3vmSHXEqAWO71CGOTW3/a58Q1PXc3D9BbBoCjs4atLY8Hd9LhpH1Snw72MGuciQoLlRfIWLDf0DcOt0ND/SMQM10AX0VuB4XNovgAD+noV+QWHSkyGd8zRgHRX30v2H1K793Hi30d8uK95CZZzwOLQw/TUSWkHyHqBErr3crz3NNj/MNF3p78Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: --0000000000002a4e5c061dec8686 Content-Type: text/plain; charset="UTF-8" On Mon, 22 Jul 2024 at 06:09, David Hildenbrand wrote: > On 20.07.24 19:35, Jerome Glisse wrote: > > Because maxnode bug there is no way to bind or migrate_pages to the > > last node in multi-node NUMA system unless you lie about maxnodes > > when making the mbind, set_mempolicy or migrate_pages syscall. > > > > Manpage for those syscall describe maxnodes as the number of bits in > > the node bitmap ("bit mask of nodes containing up to maxnode bits"). > > Thus if maxnode is n then we expect to have a n bit(s) bitmap which > > means that the mask of valid bits is ((1 << n) - 1). The get_nodes() > > decrement lead to the mask being ((1 << (n - 1)) - 1). > > > > The three syscalls use a common helper get_nodes() and first things > > this helper do is decrement maxnode by 1 which leads to using n-1 bits > > in the provided mask of nodes (see get_bitmap() an helper function to > > get_nodes()). > > > > The lead to two bugs, either the last node in the bitmap provided will > > not be use in either of the three syscalls, or the syscalls will error > > out and return EINVAL if the only bit set in the bitmap was the last > > bit in the mask of nodes (which is ignored because of the bug and an > > empty mask of nodes is an invalid argument). > > > > I am surprised this bug was never caught ... it has been in the kernel > > since forever. > > Let's look at QEMU: backends/hostmem.c > > /* > * We can have up to MAX_NODES nodes, but we need to pass maxnode+1 > * as argument to mbind() due to an old Linux bug (feature?) which > * cuts off the last specified node. This means backend->host_nodes > * must have MAX_NODES+1 bits available. > */ > > Which means that it's been known for a long time, and the workaround > seems to be pretty easy. > > So I wonder if we rather want to update the documentation to match reality. > I think it is kind of weird if we ask to supply maxnodes+1 to work around the bug. If we apply this patch qemu would continue to work as is while fixing users that were not aware of that bug. So I would say applying this patch does more good. Long term qemu can drop its workaround or keep it for backward compatibility with old kernel. --0000000000002a4e5c061dec8686 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Mon, 22 Jul 2024 at 06:09, David Hilde= nbrand <david@redhat.com> wro= te:
On 20.07.24 19:35, Jerome Glisse wrote:
> Because maxnode bug there is no way to bind or migrate_pages to the > last node in multi-node NUMA system unless you lie about maxnodes
> when making the mbind, set_mempolicy or migrate_pages syscall.
>
> Manpage for those syscall describe maxnodes as the number of bits in > the node bitmap ("bit mask of nodes containing up to maxnode bits= ").
> Thus if maxnode is n then we expect to have a n bit(s) bitmap which > means that the mask of valid bits is ((1 << n) - 1). The get_nod= es()
> decrement lead to the mask being ((1 << (n - 1)) - 1).
>
> The three syscalls use a common helper get_nodes() and first things > this helper do is decrement maxnode by 1 which leads to using n-1 bits=
> in the provided mask of nodes (see get_bitmap() an helper function to<= br> > get_nodes()).
>
> The lead to two bugs, either the last node in the bitmap provided will=
> not be use in either of the three syscalls, or the syscalls will error=
> out and return EINVAL if the only bit set in the bitmap was the last > bit in the mask of nodes (which is ignored because of the bug and an > empty mask of nodes is an invalid argument).
>
> I am surprised this bug was never caught ... it has been in the kernel=
> since forever.

Let's look at QEMU: backends/hostmem.c

=C2=A0 =C2=A0 =C2=A0/*
=C2=A0 =C2=A0 =C2=A0 * We can have up to MAX_NODES nodes, but we need to pa= ss maxnode+1
=C2=A0 =C2=A0 =C2=A0 * as argument to mbind() due to an old Linux bug (feat= ure?) which
=C2=A0 =C2=A0 =C2=A0 * cuts off the last specified node. This means backend= ->host_nodes
=C2=A0 =C2=A0 =C2=A0 * must have MAX_NODES+1 bits available.
=C2=A0 =C2=A0 =C2=A0 */

Which means that it's been known for a long time, and the workaround seems to be pretty easy.

So I wonder if we rather want to update the documentation to match reality.=

I think it is kind of weird if we ask = to supply maxnodes+1 to work around the bug. If we apply this patch qemu wo= uld continue to work as is while fixing users that were not aware of that b= ug. So I would say applying this patch does more good. Long term qemu can d= rop its workaround or keep it for backward compatibility with old kernel.

--0000000000002a4e5c061dec8686--