From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B9EF8C433EF for ; Wed, 27 Oct 2021 07:15:42 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 44F506109E for ; Wed, 27 Oct 2021 07:15:42 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 44F506109E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 9055B940008; Wed, 27 Oct 2021 03:15:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8B5B7940007; Wed, 27 Oct 2021 03:15:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 77D26940008; Wed, 27 Oct 2021 03:15:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0051.hostedemail.com [216.40.44.51]) by kanga.kvack.org (Postfix) with ESMTP id 6A765940007 for ; Wed, 27 Oct 2021 03:15:41 -0400 (EDT) Received: from smtpin40.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 280C48249980 for ; Wed, 27 Oct 2021 07:15:41 +0000 (UTC) X-FDA: 78741357282.40.C4DCA92 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf19.hostedemail.com (Postfix) with ESMTP id 087FDB0000AF for ; Wed, 27 Oct 2021 07:15:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1635318940; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Vh20uhPE29DGhh1j6gWUyV6ou+awoh0PfjpAQqQ+7uY=; b=W2c17gvOTz2VrBAx1LY7sKa9Fr+YNl/tS7jU+aXXWYStXT85Fw/2NWqjJUy64XqpsI5ZvG bFaeA7gQoa3rn2csUAR8x033h6HWVRXuO2ttRGSv5aEXcsTT/WxHnAEoULViUY1gOaK14l g7JSh6K++82NhQKgX8X84q11wmvX8k8= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-256-nREPhw9MNmyxq4kOj7cJtQ-1; Wed, 27 Oct 2021 03:15:38 -0400 X-MC-Unique: nREPhw9MNmyxq4kOj7cJtQ-1 Received: by mail-wr1-f69.google.com with SMTP id u15-20020a5d514f000000b001687ebddea3so311657wrt.8 for ; Wed, 27 Oct 2021 00:15:38 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:cc:references:from:organization:in-reply-to :content-transfer-encoding; bh=Vh20uhPE29DGhh1j6gWUyV6ou+awoh0PfjpAQqQ+7uY=; b=hmdW6YwJ9DQdDZTpduvcpNatLQFFMKoJOt2ccuIMAVc6SaWmHUSVtR+lBIka7aVe4Z srET+xTWfv+vdlfcIeJGgNGeg2ZBhmNe/6alTUHMZ4n7lPsBPoMUB2oGKkf6JY//naKP H+aZ/O/SZLwpcYgRxReWHdWGYz/tdfEhHY/UPGZXJIr5XdVb0C6arsl/uy45y2AEuFCM zkiUHZVRsly8n85EGzMk7dgmusXilDaPZZ0+reTqxhRhTVAiFp3CSD2FO98my8m/uv1A y6Mj4SZMalSnfpyFR/uk42BMZaGEkjsfH4S8w5dH8zNSaNOY7jqRjgAlL7SDL7NL8xke 3s3Q== X-Gm-Message-State: AOAM532Rsc/j34MP480RUu+ZY6dqBepg5N3sDvuubZQvJv0DaGz+w9dt vCACtl+cujTOUnOJY5o4upX/QG6lpNPitie7sn68Hdljt17D6ianPfe2jZNqzEgKyx0GACrTrdq sy5j3QFlt5/8= X-Received: by 2002:a05:6000:2c7:: with SMTP id o7mr29471940wry.95.1635318937384; Wed, 27 Oct 2021 00:15:37 -0700 (PDT) X-Google-Smtp-Source: ABdhPJznZvp77gBBcxv93V3obOyC558qd/Xvo19RDEgI1SmBqRTcw1j72Qn8OGv/YD42ghFdv+fN9g== X-Received: by 2002:a05:6000:2c7:: with SMTP id o7mr29471900wry.95.1635318937026; Wed, 27 Oct 2021 00:15:37 -0700 (PDT) Received: from [192.168.3.132] (p4ff23d76.dip0.t-ipconnect.de. [79.242.61.118]) by smtp.gmail.com with ESMTPSA id g10sm3089353wmq.13.2021.10.27.00.15.36 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 27 Oct 2021 00:15:36 -0700 (PDT) Message-ID: Date: Wed, 27 Oct 2021 09:15:35 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.1.0 Subject: Re: [PATCH v1] mm, pagemap: expose hwpoison entry To: Peter Xu , Naoya Horiguchi Cc: Dave Hansen , linux-mm@kvack.org, Andrew Morton , Alistair Popple , Mike Kravetz , Konstantin Khlebnikov , Bin Wang , Yang Shi , Naoya Horiguchi , linux-kernel@vger.kernel.org References: <20211004115001.1544259-1-naoya.horiguchi@linux.dev> <258d0ddb-6c82-0c95-a15e-b085b59d2142@redhat.com> <20211004143228.GA1545442@u2004> <20211026232736.GA2704541@u2004> <20211027064513.GA2717516@u2004> From: David Hildenbrand Organization: Red Hat In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 087FDB0000AF X-Stat-Signature: ssbc9z1373hd7zb46xtts8kukjbyhsnm Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=W2c17gvO; spf=none (imf19.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-HE-Tag: 1635318935-4663 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 27.10.21 09:02, Peter Xu wrote: > On Wed, Oct 27, 2021 at 03:45:13PM +0900, Naoya Horiguchi wrote: >> On Wed, Oct 27, 2021 at 10:09:03AM +0800, Peter Xu wrote: >>> On Wed, Oct 27, 2021 at 08:27:36AM +0900, Naoya Horiguchi wrote: >>>> On Mon, Oct 04, 2021 at 11:32:28PM +0900, Naoya Horiguchi wrote: >>>>> On Mon, Oct 04, 2021 at 01:55:30PM +0200, David Hildenbrand wrote: >>>>>> On 04.10.21 13:50, Naoya Horiguchi wrote: >>>> ... >>>>>>> >>>>>>> Hwpoison entry for hugepage is also exposed by this patch. The below >>>>>>> example shows how pagemap is visible in the case where a memory error >>>>>>> hit a hugepage mapped to a process. >>>>>>> >>>>>>> $ ./page-types --no-summary --pid $PID --raw --list --addr 0x700000000+0x400 >>>>>>> voffset offset len flags >>>>>>> 700000000 12fa00 1 ___U_______Ma__H_G_________________f_______1 >>>>>>> 700000001 12fa01 1ff ___________Ma___TG_________________f_______1 >>>>>>> 700000200 12f800 1 __________B________X_______________f______w_ >>>>>>> 700000201 12f801 1 ___________________X_______________f______w_ // memory failure hit this page >>>>>>> 700000202 12f802 1fe __________B________X_______________f______w_ >>>>>>> >>>>>>> The entries with both of "X" flag (hwpoison flag) and "w" flag (swap >>>>>>> flag) are considered as hwpoison entries. So all pages in 2MB range >>>>>>> are inaccessible from the process. We can get actual error location >>>>>>> by page-types in physical address mode. >>>>>>> >>>>>>> $ ./page-types --no-summary --addr 0x12f800+0x200 --raw --list >>>>>>> offset len flags >>>>>>> 12f800 1 __________B_________________________________ >>>>>>> 12f801 1 ___________________X________________________ >>>>>>> 12f802 1fe __________B_________________________________ >>>>>>> >>>>>>> Signed-off-by: Naoya Horiguchi >>>>>>> --- >>>>>>> fs/proc/task_mmu.c | 41 ++++++++++++++++++++++++++++++++--------- >>>>>>> include/linux/swapops.h | 13 +++++++++++++ >>>>>>> tools/vm/page-types.c | 7 ++++++- >>>>>>> 3 files changed, 51 insertions(+), 10 deletions(-) >>>>>> >>>>>> >>>>>> Please also update the documentation located at >>>>>> >>>>>> Documentation/admin-guide/mm/pagemap.rst >>>>> >>>>> I will do this in the next post. >>>> >>>> Reading the document, I found that swap type is already exported so we >>>> could identify hwpoison entry with it (without new PM_HWPOISON bit). >>>> One problem is that the format of swap types (like SWP_HWPOISON) depends >>>> on a few config macros like CONFIG_DEVICE_PRIVATE and CONFIG_MIGRATION, >>>> so we also need to export how the swap type field is interpreted. >>> >>> I had similar question before.. though it was more on the generic swap entries >>> not the special ones yet. >>> >>> The thing is I don't know how the userspace could interpret normal swap device >>> indexes out of reading pagemap, say if we have two swap devices with "swapon >>> -s" then I've no idea how do we know which device has which swap type index >>> allocated. That seems to be a similar question asked above on special swap >>> types - the interface seems to be incomplete, if not unused at all. >>> >>> AFAIU the information on "this page is swapped out to device X on offset Y" is >>> not reliable too, because the pagein/pageout from kernel is transparent to the >>> userspace and not under control of userspace at all. IOW, if the user reads >>> that swap entry, then reads data upon the disk of that offset out and put it >>> somewhere else, then it means the data read could already be old if kernel >>> paged in the page after userspace reading the pagemap but before it reading the >>> disk, and I don't see any way to make it right unless the userspace could stop >>> the kernel from page-in a swap entry. That's why I really wonder whether we >>> should expose normal swap entry at all, as I don't know how it could be helpful >>> and used in the 100% right way. >> >> Thank you for the feedback. >> >> I think that a process interested in controlling swap-in/out behavior in its own >> typically calls mincore() to get current status and madvise() to trigger swap-in/out. >> That's not 100% solution for the same reason, but it mostly works well because >> calling madvise(MADV_PAGEOUT) to already swapped out is not a big issue (although >> some CPU/memory resource is wasted, but the amount of the waste is small if the >> returned info is new enough). >> So my point is that the concern around information newness might be more generic >> issue rather than just for pagemap. If we need 100% accurate in-kernel info, >> maybe it had better be done in kernel (or some cooler stuff like eBPF)? > > I fully agree the solution you mentioned with mincore() and madvise(), that is > very sane and working approach. Though IMHO the major thing I wanted to point > out is for generic swap devices we exposed (disk_index, disk_offset) tuple as > the swap entry (besides "whether this page is swapped out or not"; that's > PM_SWAP, and as you mentioned people'll need to rely on mincore() to make it > right for shmem), though to use it we need to either record the index/offset or > read/write data from it. However none of them will make sense, IMHO.. So I > think exposing PM_SWAP makes sense, not the swap entries on swap devices. > >> >>> >>> Special swap entries seem a bit different - at least for is_pfn_swap_entry() >>> typed swap entries we can still expose the PFN which might be helpful, which I >>> can't tell. >> >> I'm one who think it helpful for testing, although I know testing might not be >> considered as a real usecase. > > I think testing is valid use case too. > >> >>> >>> I used to send an email to Matt Mackall and Dave Hansen >>> asking about above but didn't get a reply. Ccing >>> again this time with the list copied. >>> >>>> >>>> I thought of adding new interfaces for example under /sys/kernel/mm/swap/type_format/, >>>> which shows info like below (assuming that all CONFIG_{DEVICE_PRIVATE,MIGRATION,MEMORY_FAILURE} >>>> is enabled): >>>> >>>> $ ls /sys/kernel/mm/swap/type_format/ >>>> hwpoison >>>> migration_read >>>> migration_write >>>> device_write >>>> device_read >>>> device_exclusive_write >>>> device_exclusive_read >>>> >>>> $ cat /sys/kernel/mm/swap/type_format/hwpoison >>>> 25 >>>> >>>> $ cat /sys/kernel/mm/swap/type_format/device_write >>>> 28 >>>> >>>> Does it make sense or any better approach? >>> >>> Then I'm wondering whether we care about the rest of the normal swap devices >>> too with pagemap so do we need to expose some information there too (only if >>> there's a real use case, though..)? Or... should we just don't expose swap >>> entries at all, at least generic swap entries? We can still expose things like >>> hwpoison via PM_* bits well defined in that case. >> >> I didn't think about normal swap devices for no reason. I'm OK to stop exposing >> normal swap device part. I don't have strong option yet about which approach >> (in swaptype or PM_HWPOISON) I'll suggest next (so wait a little more for feedback). > > No strong opinion here too. It's just that the new interface proposed reminded > me that it's partially complete if considering we're also exposing swap entries > on swap devices, so the types didn't cover those entries. However it's more > like a pure question because I never figured out how those entries will work > anyway. I'd be willing to know whether Dave Hanson would comment on this. > > While the PM_HWPOISON approach looks always sane to me. I consider that somehow cleaner, because how HWPOISON entries are implemented ("fake swap entries") is somewhat an internal implementation detail. (I also agree that PM_SWAP makes sense, but maybe really only when we're actually dealing with something that has been/is currently being swapped out. Maybe we should just not expose fake swap entries via PM_SWAP and instead use proper PM_ types for that. PM_MIGRATION, PM_HWPOISON, ...) -- Thanks, David / dhildenb