Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
* [RFC] page cache drop-behind
@ 2020-06-10  2:39 Matthew Wilcox
  2020-06-11 20:30 ` Andres Freund
  0 siblings, 1 reply; 2+ messages in thread
From: Matthew Wilcox @ 2020-06-10  2:39 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: Andres Freund


Andres reported a problem recently where reading a file several times
the size of memory causes intermittent stalls.  My suspicion is that
page allocation eventually runs into the low watermark and starts to
do reclaim.  Some shrinkers take a long time to run and have a low chance
of actually freeing a page (eg the dentry cache needs to free 21 dentries
which all happen to be on the same pair of pages to free those two pages).

This patch attempts to free pages from the file that we're currently
reading from if there are no pages readily available.  If that doesn't
work, we'll run all the shrinkers just as we did before.

This should solve Andres' problem, although it's a bit narrow in scope.
It might be better to look through the inactive page list, regardless of
which file they were allocated for.  That could solve the "weekly backup"
problem with lots of little files.

I'm not really set up to do performance testing at the moment, so this
is just me thinking hard about the problem.

diff --git a/mm/readahead.c b/mm/readahead.c
index 3c9a8dd7c56c..3531e1808e24 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -111,9 +111,24 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 	}
 	return ret;
 }
-
 EXPORT_SYMBOL(read_cache_pages);
 
+/*
+ * Attempt to detect a streaming workload which exceeds memory and
+ * handle it by dropping the page cache behind the active part of the
+ * file.
+ */
+static void discard_behind(struct file *file, struct address_space *mapping)
+{
+	unsigned long keep = file->f_ra.ra_pages * 2;
+
+	if (mapping->nrpages < 1000)
+		return;
+	if (file->f_ra.start < keep)
+		return;
+	invalidate_mapping_pages(mapping, 0, file->f_ra.start - keep);
+}
+
 static void read_pages(struct readahead_control *rac, struct list_head *pages,
 		bool skip_page)
 {
@@ -179,6 +194,7 @@ void page_cache_readahead_unbounded(struct address_space *mapping,
 {
 	LIST_HEAD(page_pool);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
+	gfp_t light_gfp = gfp_mask & ~__GFP_DIRECT_RECLAIM;
 	struct readahead_control rac = {
 		.mapping = mapping,
 		.file = file,
@@ -219,7 +235,11 @@ void page_cache_readahead_unbounded(struct address_space *mapping,
 			continue;
 		}
 
-		page = __page_cache_alloc(gfp_mask);
+		page = __page_cache_alloc(light_gfp);
+		if (!page) {
+			discard_behind(file, mapping);
+			page = __page_cache_alloc(gfp_mask);
+		}
 		if (!page)
 			break;
 		if (mapping->a_ops->readpages) {

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [RFC] page cache drop-behind
  2020-06-10  2:39 [RFC] page cache drop-behind Matthew Wilcox
@ 2020-06-11 20:30 ` Andres Freund
  0 siblings, 0 replies; 2+ messages in thread
From: Andres Freund @ 2020-06-11 20:30 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2737 bytes --]

Hi Matthew,

That's great!


On 2020-06-09 19:39:39 -0700, Matthew Wilcox wrote:
> Andres reported a problem recently where reading a file several times
> the size of memory causes intermittent stalls.  My suspicion is that
> page allocation eventually runs into the low watermark and starts to
> do reclaim.  Some shrinkers take a long time to run and have a low chance
> of actually freeing a page (eg the dentry cache needs to free 21 dentries
> which all happen to be on the same pair of pages to free those two pages).

That meshes with some of what I saw in profiles, IIRC.

There's a related issue that I don't think this would solve, but which
could be solved by some form of write-behind, in particular that I've
observed that reaching dirty data limits often leads to a pretty random
selection of pages being written back.


> This patch attempts to free pages from the file that we're currently
> reading from if there are no pages readily available.  If that doesn't
> work, we'll run all the shrinkers just as we did before.

> This should solve Andres' problem, although it's a bit narrow in scope.
> It might be better to look through the inactive page list, regardless of
> which file they were allocated for.  That could solve the "weekly backup"
> problem with lots of little files.

I wonder if there are cases where that'd cause problems:
1) If the pages selected are still dirty, doesn't this have a
   significant potential for additional stalls?
2) For some database/postgres tasks it's pretty common to occasionally
   do in-place writes where parts of the data is already in the page
   cache, but others aren't. A bit worried that this'd be more aggre
   aggressive throwing away pages that were already cached before the
   writes.


> I'm not really set up to do performance testing at the moment, so this
> is just me thinking hard about the problem.

The workload where I was observing the issue was creating backups of
larger postgres databases. I've attached the test program we've used.

gcc -Wall -ggdb ~/tmp/write_and_fsync.c -o /tmp/write_and_fsync && \
  rm -f /srv/dev/bench/test* && \
  echo 3 |sudo tee /proc/sys/vm/drop_caches && \
  perf stat -a -e cpu-clock,ref-cycles,cycles,instructions \
     /tmp/write_and_fsync --blocksize $((128*1024)) --sync_file_range=0 --fallocate=0 --fadvise=0 --sequential=0 --filesize=$((100*1024*1024*1024)) /srv/dev/bench/test{1,2,3,4}

Was the last invocation I found in my shell history :)

I found that sync_file_range=1, fadvise=1 often gave considerably better
performance. Here's a pointer to the thread with more details (see also
my response downthread):
https://postgr.es/m/20200503023643.x2o2244sa4vkikyb%40alap3.anarazel.de

Greetings,

Andres Freund

[-- Attachment #2: write_and_fsync.c --]
[-- Type: text/x-csrc, Size: 5570 bytes --]

#define _GNU_SOURCE

#include <fcntl.h>
#include <fcntl.h>
#include <getopt.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <time.h>
#include <unistd.h>
#include <stdbool.h>

#define SFR_START_WRITE_DELAY (1 * 1024 * 1024)
#define SFR_WAIT_WRITE_DELAY (8 * 1024 * 1024)
#define SFR_START_SIZE (512 * 1024)
#define SFR_WAIT_SIZE (8 * 1024 * 1024)

#define FADVISE_DONTNEED_SIZE (512 * 1024)
#define FADVISE_DONTNEED_DELAY (SFR_WAIT_WRITE_DELAY + FADVISE_DONTNEED_SIZE)

typedef struct runparams
{
	uint32_t blocksize;
	uint64_t filesize;
	int numprocs;
	int numfiles;
	char **filenames;
	bool fallocate;
	bool fadvise;
	bool sync_file_range;
	bool sequential;
} runparams;

extern void runtest(const runparams *params, char *filename);

const struct option getopt_options[] = {
	{.name = "filesize", .has_arg = required_argument,  .val = 's'},
	{.name = "blocksize", .has_arg = required_argument, .val = 'b'},
	{.name = "fallocate", .has_arg = required_argument, .val = 'a'},
	{.name = "fadvise", .has_arg = required_argument, .val = 'f'},
	{.name = "sync_file_range",  .has_arg = required_argument, .val = 'r'},
	{.name = "sequential",  .has_arg = required_argument, .val = 'q'},
	{}};

static void
helpdie(void)
{
	fprintf(stderr, "\n"
			"Usage: write_and_fsync [OPTIONS] [FILES]\n"
			"--filesize=...\n"
			"--blocksize=...\n"
			"--fallocate=yes/no/0/1\n"
			"--fadvise=yes/no/0/1\n"
			"--sync_file_range=yes/no/0/1\n"
			"--sequential=yes/no/0/1\n");
	exit(1);
}

int
main(int argc, char **argv)
{
	runparams params = {
		.blocksize = 8192,
	};
	int	status;

	while (1)
	{
		int o;

		o = getopt_long(argc, argv, "", getopt_options, NULL);

		if (o == -1)
			break;

		switch (o)
		{
			case 0:
				break;
			case 's':
				params.filesize = strtoull(optarg, NULL, 0);
				break;
			case 'b':
				params.blocksize = strtoul(optarg, NULL, 0);
				break;
			case 'a':
				params.fallocate = strcmp(optarg, "yes") == 0 || strcmp(optarg, "1") == 0;
				break;
			case 'f':
				params.fadvise = strcmp(optarg, "yes") == 0 || strcmp(optarg, "1") == 0;
				break;
			case 'r':
				params.sync_file_range = strcmp(optarg, "yes") == 0 || strcmp(optarg, "1") == 0;
				break;
			case 'q':
				params.sequential = strcmp(optarg, "yes") == 0 || strcmp(optarg, "1") == 0;
				break;
			case '?':
				helpdie();
				break;
			default:
				fprintf(stderr, "huh: %d\n", o);
				helpdie();
		}
	}

	params.filenames = &argv[optind];
	params.numprocs = argc - optind;

	if (params.numprocs <= 0 || params.filesize <= 0)
		helpdie();

	printf("running test with: numprocs=%d filesize=%llu blocksize=%d fallocate=%d sfr=%d fadvise=%d sequential=%d\n",
		   params.numprocs,
		   (unsigned long long) params.filesize, params.blocksize,
		   params.fallocate, params.sync_file_range, params.fadvise, params.sequential);
	fflush(stdout);

	for (int fileno = 0; fileno < params.numprocs; fileno++)
	{
		pid_t	pid = fork();

		if (pid == 0)
		{
			runtest(&params, params.filenames[fileno]);
			exit(0);
		}
		else if (pid < 0)
		{
			perror("fork");
			exit(1);
		}
	}

	while (wait(&status) >= 0)
		;
	sleep(1);

	return 0;
}

void
runtest(const runparams* params, char *filename)
{
	const int bs = params->blocksize;
	const uint64_t filesize = params->filesize;
	const bool sfr = params->sync_file_range;
	const bool fadv = params->fadvise;
	char *junk;

	junk = malloc(bs);
	if (!bs) exit(1);

	memset(junk, 'J', params->blocksize);

	time_t	t0 = time(NULL);
	int fd = open(filename, O_CREAT | O_TRUNC | O_WRONLY, 0600);
	if (fd < 0)
	{
		perror("open");
		exit(1);
	}

	time_t	t1 = time(NULL);

	if (params->fallocate)
	{
		if (posix_fallocate(fd, 0, filesize) != 0)
		{
			perror("posix_fallocate");
			exit(1);
		}
	}


	if (params->sequential)
	{
		if (posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL) != 0)
		{
			perror("posix_fallocate");
			exit(1);
		}
	}

	time_t	t2 = time(NULL);
	uint64_t bytes_written = 0;
	uint64_t last_wait_write = 0;
	uint64_t last_start_write = 0;
	uint64_t last_dontneed = 0;

	while (bytes_written + bs < filesize)
	{
		int wc = write(fd, junk, bs);

		if (wc != bs)
		{
			fprintf(stderr, "wc = %d\n", wc);
			perror("write");
			exit(1);
		}
		bytes_written += bs;


		/* wait for last write */
		if (sfr)
		{
			if (last_wait_write + SFR_WAIT_WRITE_DELAY + SFR_WAIT_SIZE < bytes_written)
			{
				if (sync_file_range(fd, last_wait_write, SFR_WAIT_SIZE, SYNC_FILE_RANGE_WAIT_BEFORE) != 0)
				{
					perror("sfr(wait_before)");
					exit(1);
				}
				last_wait_write += SFR_WAIT_SIZE;
			}

			if (last_start_write + SFR_START_WRITE_DELAY + SFR_START_SIZE < bytes_written)
			{

				if (sync_file_range(fd, last_start_write, SFR_START_SIZE, SYNC_FILE_RANGE_WRITE) != 0)
				{
					perror("sfr(write)");
					exit(1);
				}
				last_start_write += SFR_START_SIZE;
			}
		}

		if (fadv)
		{
			if (last_dontneed + FADVISE_DONTNEED_DELAY + FADVISE_DONTNEED_SIZE < bytes_written)
			{
				if (posix_fadvise(fd, last_dontneed, FADVISE_DONTNEED_SIZE, POSIX_FADV_DONTNEED) != 0)
				{
					perror("fadvise(dontneed)");
					exit(1);
				}
				last_dontneed += FADVISE_DONTNEED_SIZE;
			}
		}
	}

	time_t	t3 = time(NULL);
	if (fsync(fd) != 0)
	{
		perror("fsync");
		exit(1);
	}

	time_t	t4 = time(NULL);
	if (close(fd) != 0)
	{
		perror("close");
		exit(1);
	}

	time_t	t5 = time(NULL);
	printf("[%s][%d] open: %lu, fallocate: %lu write: %lu, fsync: %lu, close: %lu, total: %lu\n",
	       filename, getpid(), t1 - t0, t2 - t1, t3 - t2, t4 - t3, t5 - t4, t5 - t0);
}

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-06-11 20:31 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-10  2:39 [RFC] page cache drop-behind Matthew Wilcox
2020-06-11 20:30 ` Andres Freund

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).