From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.4 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7962EC4363D for ; Tue, 22 Sep 2020 13:40:47 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E0C5120874 for ; Tue, 22 Sep 2020 13:40:46 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="K/XqzqAz" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E0C5120874 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 1A0B6900082; Tue, 22 Sep 2020 09:40:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 152D9900063; Tue, 22 Sep 2020 09:40:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 08F21900082; Tue, 22 Sep 2020 09:40:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0252.hostedemail.com [216.40.44.252]) by kanga.kvack.org (Postfix) with ESMTP id E5EB9900063 for ; Tue, 22 Sep 2020 09:40:45 -0400 (EDT) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 9E4B71EE6 for ; Tue, 22 Sep 2020 13:40:45 +0000 (UTC) X-FDA: 77290807650.17.watch86_380e8642714e Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin17.hostedemail.com (Postfix) with ESMTP id AF8B718021BA7 for ; Tue, 22 Sep 2020 13:37:16 +0000 (UTC) X-HE-Tag: watch86_380e8642714e X-Filterd-Recvd-Size: 6940 Received: from mail-lj1-f194.google.com (mail-lj1-f194.google.com [209.85.208.194]) by imf43.hostedemail.com (Postfix) with ESMTP for ; Tue, 22 Sep 2020 13:37:15 +0000 (UTC) Received: by mail-lj1-f194.google.com with SMTP id a22so14104478ljp.13 for ; Tue, 22 Sep 2020 06:37:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=IBM0wDpz4gWqS8J0E28DVuLHcmz48nqEeS/I/Lgqig4=; b=K/XqzqAzo0bNKgUmXmj/26iPesYp4qC178KcnW/J6mEOL9FDgDG2QQXYxWN6+u/kji nMC33oyqITDgIhLDDsA3P4nxaG1PeaMhebQD7UeoAsCOJC14NPBEvawId0rszbv1ZdLN +jSNOxfLi+Z2CpDzSNwG1xCFTYW9K11iIdG8TypFy0jfYpPY8Fs1n05/Yij37fbjl7Xh MRE6ebkF2s/0+P9rCUqHlXHgxQIYQhWNEQ71By5HQ874206X4L71vF1iCRkyBKn7BC1S S19bSVfHO25TjnEBcyMBJAKkspKxkARjnz2W7B7IiLEELIJKvgCbXNydASm5g48MEMil ePzg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=IBM0wDpz4gWqS8J0E28DVuLHcmz48nqEeS/I/Lgqig4=; b=dbvA47HGS/cxuW1JGBMmskOz22s/Q3bs67q1ct6J6Q98tjG4HIjuoG2I0RkAor8PTE HRlBQ1pl7dng8/kjwyml3WQnHSISQmbIktUuu6eAxBeX9xKNvY4Axepgex+qv89uaos/ 7O0mZnxCKDNvE6rYXe4nPOjrAIBjSIcdFftSmk5NMy/gtRf9M++0mwOVkq8nBeAQlHvm DDkGtLzJDhWcRDntroVus6PEB9oud+TLLWPTPjDxg43GArTQFK7byzvuhxyiKfnns4ot JiaT0d2gSz41e+fRtDhnhphlAafLrTpYEmuLo9TcYCvAdPOxWANkL+N20mf0JNoomCQq VI4w== X-Gm-Message-State: AOAM530uEjU4g/LtbtQO+BRRpB2I3Z1nT6LZMDL3kUXWxeImiPnP1Uf1 6KMlsQi0sO5b8RIcyeVGwRNZMs8J91hI3BSkXujEVg== X-Google-Smtp-Source: ABdhPJx5zNMjKsCRF+VPgehOttxLuFwA+urSURvTzIpMkTSR6PFJQxgRcpnikd9rZ6LIy9jvbsYLVFyEf4OI3+1eAK0= X-Received: by 2002:a2e:7c09:: with SMTP id x9mr1438805ljc.192.1600781833796; Tue, 22 Sep 2020 06:37:13 -0700 (PDT) MIME-Version: 1.0 References: <20200922111202.GY12990@dhcp22.suse.cz> In-Reply-To: <20200922111202.GY12990@dhcp22.suse.cz> From: Shakeel Butt Date: Tue, 22 Sep 2020 06:37:02 -0700 Message-ID: Subject: Re: Machine lockups on extreme memory pressure To: Michal Hocko Cc: Johannes Weiner , Linux MM , Andrew Morton , Roman Gushchin , LKML , Greg Thelen Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Sep 22, 2020 at 4:12 AM Michal Hocko wrote: > > On Mon 21-09-20 11:35:35, Shakeel Butt wrote: > > Hi all, > > > > We are seeing machine lockups due extreme memory pressure where the > > free pages on all the zones are way below the min watermarks. The stack > > of the stuck CPU looks like the following (I had to crash the machine to > > get the info). > > sysrq+l didn't report anything? > Sorry I misspoke earlier that I personally crash the machine. I get to know the state of the machine from the crash dump. We have a crash timer on our machines which need to be reset every couple of hours from user space. If the user space daemon responsible to reset does not get chance to reset it, the machine get crashed, so, these crashes are where the user space timer resetter daemon could not run for couple of hours. > > #0 [ ] crash_nmi_callback > > #1 [ ] nmi_handle > > #2 [ ] default_do_nmi > > #3 [ ] do_nmi > > #4 [ ] end_repeat_nmi > > --- --- > > #5 [ ] queued_spin_lock_slowpath > > #6 [ ] _raw_spin_lock > > #7 [ ] ____cache_alloc_node > > #8 [ ] fallback_alloc > > #9 [ ] __kmalloc_node_track_caller > > #10 [ ] __alloc_skb > > #11 [ ] tcp_send_ack > > #12 [ ] tcp_delack_timer > > #13 [ ] run_timer_softirq > > #14 [ ] irq_exit > > #15 [ ] smp_apic_timer_interrupt > > #16 [ ] apic_timer_interrupt > > --- --- > > #17 [ ] apic_timer_interrupt > > #18 [ ] _raw_spin_lock > > #19 [ ] vmpressure > > #20 [ ] shrink_node > > #21 [ ] do_try_to_free_pages > > #22 [ ] try_to_free_pages > > #23 [ ] __alloc_pages_direct_reclaim > > #24 [ ] __alloc_pages_nodemask > > #25 [ ] cache_grow_begin > > #26 [ ] fallback_alloc > > #27 [ ] __kmalloc_node_track_caller > > #28 [ ] __alloc_skb > > #29 [ ] tcp_sendmsg_locked > > #30 [ ] tcp_sendmsg > > #31 [ ] inet6_sendmsg > > #32 [ ] ___sys_sendmsg > > #33 [ ] sys_sendmsg > > #34 [ ] do_syscall_64 > > > > These are high traffic machines. Almost all the CPUs are stuck on the > > root memcg's vmpressure sr_lock and almost half of the CPUs are stuck > > on kmem cache node's list_lock in the IRQ. > > Are you able to track down the lock holder? > > > Note that the vmpressure sr_lock is irq-unsafe. > > Which is ok because this is only triggered from the memory reclaim and > that cannot ever happen from the interrrupt context for obvoius reasons. > > > Couple of months back, we observed a similar > > situation with swap locks which forces us to disable swap on global > > pressure. Since we do proactive reclaim disabling swap on global reclaim > > was not an issue. However now we have started seeing the same situation > > with other irq-unsafe locks like vmpressure sr_lock and almost all the > > slab shrinkers have irq-unsafe spinlocks. One of way to mitigate this > > is by converting all such locks (which can be taken in reclaim path) > > to be irq-safe but it does not seem like a maintainable solution. > > This doesn't make much sense to be honest. We are not disabling IRQs > unless it is absolutely necessary. > > > Please note that we are running user space oom-killer which is more > > aggressive than oomd/PSI but even that got stuck under this much memory > > pressure. > > > > I am wondering if anyone else has seen a similar situation in production > > and if there is a recommended way to resolve this situation. > > I would recommend to focus on tracking down the who is blocking the > further progress. I was able to find the CPU next in line for the list_lock from the dump. I don't think anyone is blocking the progress as such but more like the spinlock in the irq context is starving the spinlock in the process context. This is a high traffic machine and there are tens of thousands of potential network ACKs on the queue. I talked about this problem with Johannes at LPC 2019 and I think we talked about two potential solutions. First was to somehow give memory reserves to oomd and second was in-kernel PSI based oom-killer. I am not sure the first one will work in this situation but the second one might help. Shakeel