* user space unresponsive, followup: lsf/mm congestion @ 2020-01-07 20:29 Chris Murphy 2020-01-07 20:58 ` Michal Hocko 0 siblings, 1 reply; 13+ messages in thread From: Chris Murphy @ 2020-01-07 20:29 UTC (permalink / raw) To: linux-mm; +Cc: Michal Hocko Hi, This is in response to: https://lore.kernel.org/linux-fsdevel/20200104090955.GF23195@dread.disaster.area/T/#m8b25fd42501d780d8053fc7aa9f4e3a28a19c49f I decided to open a bug report for tracking and attachments but I'm also subscribed now to this list so - either here or there. "loss of responsiveness during heavy swap" https://bugzilla.kernel.org/show_bug.cgi?id=206117 Thanks! -- Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: user space unresponsive, followup: lsf/mm congestion 2020-01-07 20:29 user space unresponsive, followup: lsf/mm congestion Chris Murphy @ 2020-01-07 20:58 ` Michal Hocko 2020-01-07 21:25 ` Chris Murphy 0 siblings, 1 reply; 13+ messages in thread From: Michal Hocko @ 2020-01-07 20:58 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-mm On Tue 07-01-20 13:29:20, Chris Murphy wrote: > Hi, > > This is in response to: > https://lore.kernel.org/linux-fsdevel/20200104090955.GF23195@dread.disaster.area/T/#m8b25fd42501d780d8053fc7aa9f4e3a28a19c49f > > I decided to open a bug report for tracking and attachments but I'm > also subscribed now to this list so - either here or there. > > "loss of responsiveness during heavy swap" > https://bugzilla.kernel.org/show_bug.cgi?id=206117 Please collect more snapshots of /proc/vmstat (e.g. in 1s intervals) since you start your workload. This will give us an insight on how the memory characteristics change over time. From the single snapshot we can see nr_zone_inactive_anon 247598 nr_zone_active_anon 1551129 nr_zone_inactive_file 25283 nr_zone_active_file 21563 pswpin 6002574 pswpout 11199783 which tells us that there is not all that much page cache resident (less than 200MB) while the majority of the memory is anonymous (7GB) most of it tracked as active. Btw. from a quick look at the sysrq output there seems to be quite a lot of tasks (more than 1k) running on the system. Only handful of them belong to the compilation. kswapd is busy and 13 processes in direct reclaim all swapping out to the disk. From the above, my first guess would be that you are over subscribing memory you have available. I would focus on who is consuming all that memory. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: user space unresponsive, followup: lsf/mm congestion 2020-01-07 20:58 ` Michal Hocko @ 2020-01-07 21:25 ` Chris Murphy 2020-01-08 9:25 ` Michal Hocko 0 siblings, 1 reply; 13+ messages in thread From: Chris Murphy @ 2020-01-07 21:25 UTC (permalink / raw) To: Michal Hocko; +Cc: Chris Murphy, linux-mm On Tue, Jan 7, 2020 at 1:58 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 07-01-20 13:29:20, Chris Murphy wrote: > > Hi, > > > > This is in response to: > > https://lore.kernel.org/linux-fsdevel/20200104090955.GF23195@dread.disaster.area/T/#m8b25fd42501d780d8053fc7aa9f4e3a28a19c49f > > > > I decided to open a bug report for tracking and attachments but I'm > > also subscribed now to this list so - either here or there. > > > > "loss of responsiveness during heavy swap" > > https://bugzilla.kernel.org/show_bug.cgi?id=206117 > > Please collect more snapshots of /proc/vmstat (e.g. in 1s intervals) OK. > Btw. from a quick look at the sysrq output there seems to be quite a lot > of tasks (more than 1k) running on the system. Only handful of them > belong to the compilation. kswapd is busy and 13 processes in direct > reclaim all swapping out to the disk. There might be many dozens of tabs in Firefox with nothing loaded in them, trying to keep the testing more real world (a compile while browsing) rather than being too deferential to the compile. That does clutter the sysrq+t but it doesn't change the outcome of the central culprit which is the ninja compile, which by default does n+2 jobs where n is the number of virtual CPUs. > From the above, my first guess would be that you are over subscribing > memory you have available. I would focus on who is consuming all that > memory. ninja - I have made the argument that it is in some sense sabotaging the system, and I think they're trying to do something a little smarter with their defaults; however, it's an unprivileged task acting as a kind of fork bomb that takes down the system. It's a really eyebrow raising and remarkable experience. And it's common within the somewhat vertical use case of developers compiling things on their own systems. Many IDE's use a ton of resources, as much as they can get. It's not clear to me by what mechanism either the user or these processes are supposed to effectively negotiate for limited resources, other than resource restriction. But anyway, they aren't contrived or malicious examples. A much more synthetic example is 'tail /dev/zero' which is much more quickly arrrested by the kernel oom-killer, at least on recent kernels. -- Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: user space unresponsive, followup: lsf/mm congestion 2020-01-07 21:25 ` Chris Murphy @ 2020-01-08 9:25 ` Michal Hocko 2020-01-08 21:14 ` Chris Murphy 0 siblings, 1 reply; 13+ messages in thread From: Michal Hocko @ 2020-01-08 9:25 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-mm On Tue 07-01-20 14:25:46, Chris Murphy wrote: > On Tue, Jan 7, 2020 at 1:58 PM Michal Hocko <mhocko@kernel.org> wrote: [...] > > Btw. from a quick look at the sysrq output there seems to be quite a lot > > of tasks (more than 1k) running on the system. Only handful of them > > belong to the compilation. kswapd is busy and 13 processes in direct > > reclaim all swapping out to the disk. > > There might be many dozens of tabs in Firefox with nothing loaded in > them, trying to keep the testing more real world (a compile while > browsing) rather than being too deferential to the compile. That does > clutter the sysrq+t but it doesn't change the outcome of the central > culprit which is the ninja compile, which by default does n+2 jobs > where n is the number of virtual CPUs. How much memory does the compile process eat? > > From the above, my first guess would be that you are over subscribing > > memory you have available. I would focus on who is consuming all that > > memory. > > ninja - I have made the argument that it is in some sense sabotaging > the system, and I think they're trying to do something a little > smarter with their defaults; however, it's an unprivileged task acting > as a kind of fork bomb that takes down the system. Well, I am not sure the fork bomb analogy is appropriate. There is only a dozen compile processes captured so unless there are way much more in other phases then this is really negligibe comparing to the rest of the workloads running on the system. > It's a really > eyebrow raising and remarkable experience. And it's common within the > somewhat vertical use case of developers compiling things on their own > systems. Many IDE's use a ton of resources, as much as they can get. > It's not clear to me by what mechanism either the user or these > processes are supposed to effectively negotiate for limited resources, > other than resource restriction. But anyway, they aren't contrived or > malicious examples. If you know that the compilation process is too disruptive wrt. memory/cpu consumption then you can use cgroups (memory and cpu controllers) to throttle that consumption and protect the rest of the system. The compilation process will take much more time of course and the explicit configuration is obviously less comfortable than out of the box auto configuration but the kernel simply doesn't have information to prioritize resources. I do agree that the oom detection could be improved to detect a heavy threshing - be it on page cache or swapin/out - and kill something rather than leave the system struggling in a highly unproductive state. This is far from trivial because what is productive is not something kernel can tell easily as it depends on the workload. As mentioned elsewhere userspace is likely much better suited to define that policy and PSI seems to be a good indicator. > A much more synthetic example is 'tail /dev/zero' > which is much more quickly arrrested by the kernel oom-killer, at > least on recent kernels. Yeah, same like any other memory leak because the memory will simply run out at some point and the OOM killer can detect that with a good confidence. It is the threshing (working set not fitting into memory and refaulting like crazy) that the kernel struggles (and loses) to handle. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: user space unresponsive, followup: lsf/mm congestion 2020-01-08 9:25 ` Michal Hocko @ 2020-01-08 21:14 ` Chris Murphy 2020-01-08 21:18 ` Chris Murphy 2020-01-09 11:51 ` Michal Hocko 0 siblings, 2 replies; 13+ messages in thread From: Chris Murphy @ 2020-01-08 21:14 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm On Wed, Jan 8, 2020 at 2:25 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 07-01-20 14:25:46, Chris Murphy wrote: > > On Tue, Jan 7, 2020 at 1:58 PM Michal Hocko <mhocko@kernel.org> wrote: > [...] > > > Btw. from a quick look at the sysrq output there seems to be quite a lot > > > of tasks (more than 1k) running on the system. Only handful of them > > > belong to the compilation. kswapd is busy and 13 processes in direct > > > reclaim all swapping out to the disk. > > > > There might be many dozens of tabs in Firefox with nothing loaded in > > them, trying to keep the testing more real world (a compile while > > browsing) rather than being too deferential to the compile. That does > > clutter the sysrq+t but it doesn't change the outcome of the central > > culprit which is the ninja compile, which by default does n+2 jobs > > where n is the number of virtual CPUs. > > How much memory does the compile process eat? By default it sets jobs to numcpus+2, which is 10. But each job variably has two processes, and each process's memory requirement varies a ton, few M to over 1G. In the first 20 minutes, about 13000 processes have started and stopped. I've updated the bug, attaching kernel messages and /proc/vmstate in 1s increments, although quite often during the build multiple seconds of sampling were just skipped as the system was under too much pressure. > If you know that the compilation process is too disruptive wrt. > memory/cpu consumption then you can use cgroups (memory and cpu > controllers) to throttle that consumption and protect the rest of the > system. The compilation process will take much more time of course and > the explicit configuration is obviously less comfortable than out of the > box auto configuration but the kernel simply doesn't have information to > prioritize resources. Yes but this isn't scalable for regular users who just follow an upstream's build instructions. > I do agree that the oom detection could be improved to detect a heavy > threshing - be it on page cache or swapin/out - and kill something > rather than leave the system struggling in a highly unproductive state. > This is far from trivial because what is productive is not something > kernel can tell easily as it depends on the workload. As mentioned > elsewhere userspace is likely much better suited to define that policy > and PSI seems to be a good indicator. And even user space doesn't know what resources are required in advance. The user can guess this has been estimated incorrectly, force power off, start over by passing a lower number of jobs or whatever. As for PSI, from oomd folks it sounds like swap is a requirement. And yet, because of the poor performance of swapping, quite a lot of users don't have any swap. It's also mixed in server environments to have swap, and rare in cloud environments to have swap. So if there's a hard requirement on swap existing, PSI isn't a universal solution. Thanks, -- Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: user space unresponsive, followup: lsf/mm congestion 2020-01-08 21:14 ` Chris Murphy @ 2020-01-08 21:18 ` Chris Murphy 2020-01-09 11:51 ` Michal Hocko 1 sibling, 0 replies; 13+ messages in thread From: Chris Murphy @ 2020-01-08 21:18 UTC (permalink / raw) To: Chris Murphy; +Cc: Michal Hocko, linux-mm On Wed, Jan 8, 2020 at 2:14 PM Chris Murphy <lists@colorremedies.com> wrote: > > As for PSI, from oomd folks it sounds like swap is a requirement. https://github.com/facebookincubator/oomd/issues/80 -- Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: user space unresponsive, followup: lsf/mm congestion 2020-01-08 21:14 ` Chris Murphy 2020-01-08 21:18 ` Chris Murphy @ 2020-01-09 11:51 ` Michal Hocko 2020-01-09 11:53 ` Michal Hocko 1 sibling, 1 reply; 13+ messages in thread From: Michal Hocko @ 2020-01-09 11:51 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-mm On Wed 08-01-20 14:14:22, Chris Murphy wrote: > On Wed, Jan 8, 2020 at 2:25 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Tue 07-01-20 14:25:46, Chris Murphy wrote: > > > On Tue, Jan 7, 2020 at 1:58 PM Michal Hocko <mhocko@kernel.org> wrote: > > [...] > > > > Btw. from a quick look at the sysrq output there seems to be quite a lot > > > > of tasks (more than 1k) running on the system. Only handful of them > > > > belong to the compilation. kswapd is busy and 13 processes in direct > > > > reclaim all swapping out to the disk. > > > > > > There might be many dozens of tabs in Firefox with nothing loaded in > > > them, trying to keep the testing more real world (a compile while > > > browsing) rather than being too deferential to the compile. That does > > > clutter the sysrq+t but it doesn't change the outcome of the central > > > culprit which is the ninja compile, which by default does n+2 jobs > > > where n is the number of virtual CPUs. > > > > How much memory does the compile process eat? > > By default it sets jobs to numcpus+2, which is 10. But each job > variably has two processes, and each process's memory requirement > varies a ton, few M to over 1G. In the first 20 minutes, about 13000 > processes have started and stopped. Well, the question is whether the memory demand comes from the parallelism (aka something controlled by the build process) or the compilation itself which might be really memory hungry and hard to overcome even in the parallelism disabled. If this the later is the case then there is not really much the kernel can do, I am afraid. If the OOM killer kills your compilation or other important part of your environment then you just lost a work without any gain, right? You simply need more memory to handle that workload or throttle the compilation and give it much more time to finish because it will be swapping in and out as the working set would not fit into the restricted amount of memory. > I've updated the bug, attaching kernel messages and /proc/vmstate in > 1s increments, although quite often during the build multiple seconds > of sampling were just skipped as the system was under too much > pressure. I have a tool to reduce that problem (see attached). > > If you know that the compilation process is too disruptive wrt. > > memory/cpu consumption then you can use cgroups (memory and cpu > > controllers) to throttle that consumption and protect the rest of the > > system. The compilation process will take much more time of course and > > the explicit configuration is obviously less comfortable than out of the > > box auto configuration but the kernel simply doesn't have information to > > prioritize resources. > > Yes but this isn't scalable for regular users who just follow an > upstream's build instructions. It certainly requires additional steps, no question about that. But I fail to see how to do that automagically without knowing what the user expects to happen in the resource shortage. > > I do agree that the oom detection could be improved to detect a heavy > > threshing - be it on page cache or swapin/out - and kill something > > rather than leave the system struggling in a highly unproductive state. > > This is far from trivial because what is productive is not something > > kernel can tell easily as it depends on the workload. As mentioned > > elsewhere userspace is likely much better suited to define that policy > > and PSI seems to be a good indicator. > > And even user space doesn't know what resources are required in > advance. The user can guess this has been estimated incorrectly, force > power off, start over by passing a lower number of jobs or whatever. > > As for PSI, from oomd folks it sounds like swap is a requirement. And > yet, because of the poor performance of swapping, quite a lot of users > don't have any swap. It's also mixed in server environments to have > swap, and rare in cloud environments to have swap. So if there's a > hard requirement on swap existing, PSI isn't a universal solution. I cannot really comment on the swap requirement but I would recommend to have a swap space especially for workloads which have peak memory consumption that doesn't fit into memory. Additional configuration steps might be needed so that the whole system doesn't thresh on the swap (e.g. use memcgs with proper main memory partitioning) but once you are over committing the memory you have to be careful I believe. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: user space unresponsive, followup: lsf/mm congestion 2020-01-09 11:51 ` Michal Hocko @ 2020-01-09 11:53 ` Michal Hocko 2020-01-10 6:12 ` Chris Murphy 0 siblings, 1 reply; 13+ messages in thread From: Michal Hocko @ 2020-01-09 11:53 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-mm [-- Attachment #1: Type: text/plain, Size: 373 bytes --] On Thu 09-01-20 12:51:48, Michal Hocko wrote: > > I've updated the bug, attaching kernel messages and /proc/vmstate in > > 1s increments, although quite often during the build multiple seconds > > of sampling were just skipped as the system was under too much > > pressure. > > I have a tool to reduce that problem (see attached). Now for real -- Michal Hocko SUSE Labs [-- Attachment #2: read_vmstat.c --] [-- Type: text/x-csrc, Size: 5025 bytes --] #define _GNU_SOURCE #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> #include <errno.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <limits.h> #include <unistd.h> #include <time.h> /* * A simple /proc/vmstat collector into a file. It tries hard to guarantee * that the content will get into the output file even under a strong memory * pressure. * * Usage * ./read_vmstat output_file timeout output_size * * output_file can be either a non-existing file or - for stdout * timeout - time period between two snapshots. s - seconds, ms - miliseconds * and m - minutes suffix is allowed * output_file - size of the output file. The file is preallocated and pre-filled. * * If the output reaches the end of the file it will start over overwriting the oldest * data. Each snapshot is enclosed by header and footer. * =S timestamp * [...] * E= * * Please note that your ulimit has to be sufficient to allow to mlock the code+ * all the buffers. * * This comes under GPL v2 * Copyright: Michal Hocko <mhocko@suse.cz> 2015 */ #define NS_PER_MS (1000*1000) #define NS_PER_SEC (1000*NS_PER_MS) int open_file(const char *str) { int fd; fd = open(str, O_CREAT|O_EXCL|O_RDWR, 0755); if (fd == -1) { perror("open input"); return 1; } return fd; } int read_timeout(const char *str, struct timespec *timeout) { char *end; unsigned long val; val = strtoul(str, &end, 10); if (val == ULONG_MAX) { perror("Invalid number"); return 1; } switch(*end) { case 's': timeout->tv_sec = val; break; case 'm': /* ms vs minute*/ if (*(end+1) == 's') { timeout->tv_sec = (val * NS_PER_MS) / NS_PER_SEC; timeout->tv_nsec = (val * NS_PER_MS) % NS_PER_SEC; } else { timeout->tv_sec = val*60; } break; default: fprintf(stderr, "Uknown number %s\n", str); return 1; } return 0; } size_t read_size(const char *str) { char *end; size_t val = strtoul(str, &end, 10); switch (*end) { case 'K': val <<= 10; break; case 'M': val <<= 20; break; case 'G': val <<= 30; break; } return val; } size_t dump_str(char *buffer, size_t buffer_size, size_t pos, const char *in, size_t size) { size_t i; for (i = 0; i < size; i++) { buffer[pos] = in[i]; pos = (pos + 1) % buffer_size; } return pos; } /* buffer == NULL -> stdout */ int __collect_logs(const struct timespec *timeout, char *buffer, size_t buffer_size) { char buff[4096]; /* dump to the file automatically */ time_t before, after; int in_fd = open("/proc/vmstat", O_RDONLY); size_t out_pos = 0; size_t in_pos = 0; size_t size = 0; int estimate = 0; if (in_fd == -1) { perror("open vmstat:"); return 1; } /* lock everything in */ if (mlockall(MCL_CURRENT) == -1) { perror("mlockall. Continuing anyway"); } while (1) { before = time(NULL); size = snprintf(buff, sizeof(buff), "=S %lu\n", before); lseek(in_fd, 0, SEEK_SET); size += read(in_fd, buff + size, sizeof(buff) - size); size += snprintf(buff + size, sizeof(buff) - size, "E=\n"); if (buffer && !estimate) { printf("Estimated %d entries fit to the buffer\n", buffer_size/size); estimate = 1; } /* Dump to stdout */ if (!buffer) { printf("%s", buff); } else { size_t pos; pos = dump_str(buffer, buffer_size, out_pos, buff, size); if (pos < out_pos) fprintf(stderr, "%lu: Buffer wrapped\n", before); out_pos = pos; } after = time(NULL); if (after - before > 2) { fprintf(stderr, "%d: Snapshoting took %d!!!\n", before, after-before); } if (nanosleep(timeout, NULL) == -1) if (errno == EINTR) return 0; /* kick in the flushing */ if (buffer) msync(buffer, buffer_size, MS_ASYNC); } } int collect_logs(int fd, const struct timespec *timeout, size_t buffer_size) { unsigned char *buffer = NULL; if (fd != -1) { if (ftruncate(fd, buffer_size) == -1) { perror("ftruncate"); return 1; } if (fallocate(fd, 0, 0, buffer_size) && errno != EOPNOTSUPP) { perror("fallocate"); return 1; } /* commit it to the disk */ sync(); buffer = mmap(NULL, buffer_size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, 0); if (buffer == MAP_FAILED) { perror("mmap"); return 1; } } return __collect_logs(timeout, buffer, buffer_size); } int main(int argc, char **argv) { struct timespec timeout = {.tv_sec = 1}; int fd = -1; size_t buffer_size = 10UL<<20; if (argc > 1) { /* output file */ if (strcmp(argv[1], "-")) { fd = open_file(argv[1]); if (fd == -1) return 1; } /* timeout */ if (argc > 2) { if (read_timeout(argv[2], &timeout)) return 1; /* buffer size */ if (argc > 3) { buffer_size = read_size(argv[3]); if (buffer_size == -1UL) return 1; } } } printf("file:%s timeout:%lu.%lus buffer_size:%llu\n", (fd == -1)? "stdout" : argv[1], timeout.tv_sec, timeout.tv_nsec / NS_PER_MS, buffer_size); return collect_logs(fd, &timeout, buffer_size); } ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: user space unresponsive, followup: lsf/mm congestion 2020-01-09 11:53 ` Michal Hocko @ 2020-01-10 6:12 ` Chris Murphy 2020-01-10 11:07 ` Michal Hocko 0 siblings, 1 reply; 13+ messages in thread From: Chris Murphy @ 2020-01-10 6:12 UTC (permalink / raw) To: Michal Hocko; +Cc: Chris Murphy, linux-mm On Thu, Jan 9, 2020 at 4:53 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Thu 09-01-20 12:51:48, Michal Hocko wrote: > > > I've updated the bug, attaching kernel messages and /proc/vmstate in > > > 1s increments, although quite often during the build multiple seconds > > > of sampling were just skipped as the system was under too much > > > pressure. > > > > I have a tool to reduce that problem (see attached). > > Now for real https://bugzilla.kernel.org/show_bug.cgi?id=206117#c6 -- Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: user space unresponsive, followup: lsf/mm congestion 2020-01-10 6:12 ` Chris Murphy @ 2020-01-10 11:07 ` Michal Hocko 2020-01-10 22:27 ` Chris Murphy 2020-01-12 0:07 ` Chris Murphy 0 siblings, 2 replies; 13+ messages in thread From: Michal Hocko @ 2020-01-10 11:07 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-mm On Thu 09-01-20 23:12:49, Chris Murphy wrote: > On Thu, Jan 9, 2020 at 4:53 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Thu 09-01-20 12:51:48, Michal Hocko wrote: > > > > I've updated the bug, attaching kernel messages and /proc/vmstate in > > > > 1s increments, although quite often during the build multiple seconds > > > > of sampling were just skipped as the system was under too much > > > > pressure. > > > > > > I have a tool to reduce that problem (see attached). > > > > Now for real > > https://bugzilla.kernel.org/show_bug.cgi?id=206117#c6 Let me paste the comment here: : Using Michal Hocko's read_vmstat.c to record ~12 minutes while : running 'ninja' to compile webkitgtk. I killed ninja ~20s before : killing read_vmstat. The system is essentially unusable for 7 minutes : prior to killing ninja which could only be done remotely (which took : maybe 1/2 minute to enter and execute the kill command; ssh was very : sluggish). During those 7 minutes, swap was not even 1/2 full, but the : GUI frozen most of the time including the mouse pointer. file:stdout timeout:1.0s buffer_size:10485760 =S 1578635397 So you have redirected the output (stdout) to a file. This is less effective than using a file directly because the progy makes sure to preallocate and mlock the output file data as well. Anyway, let's have a look what you managed to gather $ grep "=S" bug206117_read_vmstat.out | awk '{diff=$2-prev; if (prev && diff > 1) printf "%d %d\n", prev, diff; prev=$2}' 1578635742 2 1578635755 2 1578635776 2 1578635878 2 1578635943 2 1578635967 2 1578636027 2 1578636053 2 1578636111 3 1578636131 2 1578636138 2 so there are few instances when the next snapshot was not gathered next second. Most of the time it was 2s but that doesn't necessarily mean a problem because the timestamp resolution could have caused that this was just bad timing and the diff was close to 1s. There is one timestamp when this was 3s which looks like a real stall though (this shouldn't happen with the mlocked output file). Let's focus on the timestep though (it is 11min since start): 1578636111 diff pgalloc_dma 4 0 pgalloc_movable 0 0 pgalloc_dma32 6468509 14278 pgalloc_normal 40905599 37286 that's 200M of allocations which is not much pgsteal_kswapd 3384259 10763 pgscan_kswapd 5282874 21772 pgsteal_direct 3189026 30400 pgscan_direct 8372523 85974 which is 106M reclaimed during that time and the reclaim was not particularly effective kswapd 49% and direct reclaim 35%. nr_active_file 26652 -3564 nr_inactive_file 25731 -4772 nr_active_anon 1502798 6433 nr_inactive_anon 283561 4385 File LRUs are quite small (~200M) and got shrunk some while the anonymous LRUs eat almost 8G. while the actively referenced list is 5 times bigger than inactive (5.8G). That suggests that a large part of the memory is heavily in use. pswpin 1222809 4882 pswpout 3475546 7814 Around 30M has been swapped out but around 60% of that amount has been swapped back in and that suggests that we are actively refaulting on the swap. Please note that a large number of anonymous refaults on swap entries might be satisfied from the swap cache. That information is not presented in /proc/vmstate unfortunately. On the page cache front workingset_activate 661011 7566 workingset_refault 1894851 14868 workingset_restore 398375 4622 50% of refaults are active and 30% are marked as a workingset. Now let's see how those numbers evolve over time $ grep pswp bug206117_read_vmstat.out | awk '/pswpin/{pswpin=$2} /pswpout/{pswpout=$2; if (pswpout) {print pswpin*100/pswpout}}' | calc_min_max.awk min: 0.06 max: 35.28 avg: 22.49 std: 11.30 nr: 592 The swapin refault builds up over time which matches the expectation that the memory demand builds up as well and it doesn't fit into RAM while the workload still needs the memory that has been swapped out. The first swapout happens shortly before we start the direct reclaim: $ awk '/=S/{ts=$2} /allocstall/{if ($2) {print ts; exit}}' bug206117_read_vmstat.out 1578635582 $ awk '/=S/{ts=$2} /pswpout/{if ($2) {print ts; exit}}' bug206117_read_vmstat.out 1578635581 Page cache refault have a slightly higher dynamics $ grep workingset bug206117_read_vmstat.out | awk '/workingset_refault/{workingset_refault=$2} /workingset_activate/{workingset_activate=$2; if (workingset_refault) {print workingset_activate*100/workingset_refault}}' 100 91.453 91.453 91.6201 91.6201 96.587 96.7213 96.7638 96.7777 97.1401 70.2157 [...] 22.4038 22.4447 22.3699 22.3929 22.3724 22.3897 22.3873 22.3846 22.3847 [...] 34.9863 34.9863 34.9847 34.9847 34.9847 34.9846 34.9239 which means that as soon as we start refaulting it is the activating memory then the activation decreases for some time, most likely because we start swapping out more and the activation rate stays around 30% in the end. All that being said numbers confirm my initial suspicion. The workload you are trying to run simply doesn't fit into memory and therefore it has to refault both from the disk for the page cache (likely also executables) and the swap space. It is not really obvious whether changing file vs. anonymous reclaim balance would help much in this case. As we can see swap out starts before tasks are stalled in the direct reclaim. It is not really clear whether swapping out more would help to put some relief on the pagecache. So to me it looks like the reclaim is not doing a terrible job here considering the memory demand. It would interesting to see whether tuning vm_swappiness to 100 helps but considering how large is the anonymous active list I would be very skeptical. So in the end it is really hard to see what the kernel should have done better in this overcommitted case. Killing memory hogs would likely kill an active workload which would lead to better desktop experience but I can imagine setups which simply want to have work done albeit sloooowly. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: user space unresponsive, followup: lsf/mm congestion 2020-01-10 11:07 ` Michal Hocko @ 2020-01-10 22:27 ` Chris Murphy 2020-01-14 9:46 ` Michal Hocko 2020-01-12 0:07 ` Chris Murphy 1 sibling, 1 reply; 13+ messages in thread From: Chris Murphy @ 2020-01-10 22:27 UTC (permalink / raw) To: Michal Hocko; +Cc: Chris Murphy, linux-mm On Fri, Jan 10, 2020 at 4:07 AM Michal Hocko <mhocko@kernel.org> wrote: > > So you have redirected the output (stdout) to a file. This is less > effective than using a file directly because the progy makes sure to > preallocate and mlock the output file data as well. Anyway, let's have a > look what you managed to gather I just read the source :P and see the usage. I'll do that properly if there's a next time. Should it be saved in /tmp to avoid disk writes or does it not matter? > It would interesting to see whether tuning vm_swappiness to 100 helps > but considering how large is the anonymous active list I would be very > skeptical. I can try it. Is it better to capture the same amount of time as before? Or the entire thing until it fails or is stuck for at least 30 minutes? > So in the end it is really hard to see what the kernel should have done > better in this overcommitted case. Killing memory hogs would likely kill > an active workload which would lead to better desktop experience but I > can imagine setups which simply want to have work done albeit sloooowly. Right, so the kernel can't know and doesn't really want to know, user intention. It's really a policy question. But if the distribution wanted to have a policy of, the mouse pointer always works - i.e. the user should be able to kill this process, if they want, from within the GUI - that implies possibly a lot of work to carve out the necessary resources for that entire stack. I have no idea if that's possible with the current state of things. Anyway, I see it's a difficult problem, and I appreciate the explanations. I don't care about this particular example, my interest is making it better for everyone - I personally run into this only when I'm testing for it, but those who experience it, experience it often. And they're often developers. They have no idea in advance what the build resource requirements are, and those requirements change a lot as the compile happens. Difficult problem. -- Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: user space unresponsive, followup: lsf/mm congestion 2020-01-10 22:27 ` Chris Murphy @ 2020-01-14 9:46 ` Michal Hocko 0 siblings, 0 replies; 13+ messages in thread From: Michal Hocko @ 2020-01-14 9:46 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-mm On Fri 10-01-20 15:27:10, Chris Murphy wrote: > On Fri, Jan 10, 2020 at 4:07 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > So you have redirected the output (stdout) to a file. This is less > > effective than using a file directly because the progy makes sure to > > preallocate and mlock the output file data as well. Anyway, let's have a > > look what you managed to gather > > I just read the source :P and see the usage. I'll do that properly if > there's a next time. Should it be saved in /tmp to avoid disk writes > or does it not matter? The usage is described at the top of the c file. As this is my internal tool I am using I didn't bother to make it super easy ;) > > It would interesting to see whether tuning vm_swappiness to 100 helps > > but considering how large is the anonymous active list I would be very > > skeptical. > > I can try it. Is it better to capture the same amount of time as > before? Or the entire thing until it fails or is stuck for at least 30 > minutes? The last data provided a good insight so following the same methodology should be good. > > So in the end it is really hard to see what the kernel should have done > > better in this overcommitted case. Killing memory hogs would likely kill > > an active workload which would lead to better desktop experience but I > > can imagine setups which simply want to have work done albeit sloooowly. > > Right, so the kernel can't know and doesn't really want to know, user > intention. It's really a policy question. > > But if the distribution wanted to have a policy of, the mouse pointer > always works - i.e. the user should be able to kill this process, if > they want, from within the GUI - that implies possibly a lot of work > to carve out the necessary resources for that entire stack. I have no > idea if that's possible with the current state of things. Well, you always have a choice to invoke the oom killer by sysrq+f and kill the memory hog like that. The more memory demanding the userspace is the more users have to think how to partition the memory as a resource. We have tooling for that it just has to be used. > Anyway, I see it's a difficult problem, and I appreciate the > explanations. I don't care about this particular example, my interest > is making it better for everyone - I personally run into this only > when I'm testing for it, but those who experience it, experience it > often. And they're often developers. They have no idea in advance what > the build resource requirements are, and those requirements change a > lot as the compile happens. Difficult problem. I can only encourage people to report those problems and we can see where we get from there. Underlying problem might be different even though symptoms seem to be similar. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: user space unresponsive, followup: lsf/mm congestion 2020-01-10 11:07 ` Michal Hocko 2020-01-10 22:27 ` Chris Murphy @ 2020-01-12 0:07 ` Chris Murphy 1 sibling, 0 replies; 13+ messages in thread From: Chris Murphy @ 2020-01-12 0:07 UTC (permalink / raw) To: Michal Hocko; +Cc: Chris Murphy, linux-mm On Fri, Jan 10, 2020 at 4:07 AM Michal Hocko <mhocko@kernel.org> wrote: > > So you have redirected the output (stdout) to a file. This is less > effective than using a file directly because the progy makes sure to > preallocate and mlock the output file data as well. Anyway, let's have a > look what you managed to gather [chris@fmac tmp]$ /home/chris/read_vmstat readvmstatwkgtk.out file:readvmstatwkgtk.out timeout:1.0s buffer_size:10485760 mlockall. Continuing anyway: Cannot allocate memory Estimated 3744 entries fit to the buffer ^C This is with vm.swappiness=100, is also ~12 minutes. Responsiveness appears unchanged. File is attached here: https://bugzilla.kernel.org/show_bug.cgi?id=206117#c8 More curious is I get a call trace I haven't seen before when I try to login to this hung system, via ssh. That's in https://bugzilla.kernel.org/show_bug.cgi?id=206117#c7 -- Chris Murphy ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2020-01-14 9:46 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-01-07 20:29 user space unresponsive, followup: lsf/mm congestion Chris Murphy 2020-01-07 20:58 ` Michal Hocko 2020-01-07 21:25 ` Chris Murphy 2020-01-08 9:25 ` Michal Hocko 2020-01-08 21:14 ` Chris Murphy 2020-01-08 21:18 ` Chris Murphy 2020-01-09 11:51 ` Michal Hocko 2020-01-09 11:53 ` Michal Hocko 2020-01-10 6:12 ` Chris Murphy 2020-01-10 11:07 ` Michal Hocko 2020-01-10 22:27 ` Chris Murphy 2020-01-14 9:46 ` Michal Hocko 2020-01-12 0:07 ` Chris Murphy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox