Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
From: "Weber, Olaf (HPC Data Management & Storage)" <olaf.weber@hpe.com>
To: Gabriel Krisman Bertazi <krisman@collabora.co.uk>,
	"tytso@mit.edu" <tytso@mit.edu>,
	"david@fromorbit.com" <david@fromorbit.com>
Cc: "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"kernel@lists.collabora.co.uk" <kernel@lists.collabora.co.uk>,
	"alvaro.soliverez@collabora.co.uk"
	<alvaro.soliverez@collabora.co.uk>
Subject: RE: [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library
Date: Fri, 12 Jan 2018 10:38:25 +0000	[thread overview]
Message-ID: <DF4PR8401MB0812A452CD9D28F218A731D685170@DF4PR8401MB0812.NAMPRD84.PROD.OUTLOOK.COM> (raw)
In-Reply-To: <20180112071234.29470-8-krisman@collabora.co.uk>

Hi Gabriel,

A couple of comments inline below.

Olaf Weber

> -----Original Message-----
> From: Gabriel Krisman Bertazi [mailto:krisman@collabora.co.uk]
> Sent: Friday, January 12, 2018 08:12
> To: tytso@mit.edu; david@fromorbit.com; bpm@sgi.com; olaf@sgi.com
> Cc: linux-ext4@vger.kernel.org; linux-fsdevel@vger.kernel.org;
> kernel@lists.collabora.co.uk; alvaro.soliverez@collabora.co.uk; Gabriel
> Krisman Bertazi <krisman@collabora.co.uk>
> Subject: [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets
> library
> 
> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
> ---
>  lib/charsets/Makefile    |   2 +-
>  lib/charsets/utf8_core.c | 178
> +++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 179 insertions(+), 1 deletion(-)
>  create mode 100644 lib/charsets/utf8_core.c
> 
> diff --git a/lib/charsets/Makefile b/lib/charsets/Makefile
> index 95389c4193b0..5e2fa7c20a47 100644
> --- a/lib/charsets/Makefile
> +++ b/lib/charsets/Makefile
> @@ -4,7 +4,7 @@ obj-$(CONFIG_CHARSETS) += charsets.o
> 
>  obj-$(CONFIG_CHARSETS) += ascii.o
> 
> -utf8-y += utf8norm.o
> +utf8-y += utf8_core.o utf8norm.o
>  obj-$(CONFIG_UTF8_NORMALIZATION) +=  utf8.o
> 
>  $(obj)/utf8norm.o: $(obj)/utf8data.h
> diff --git a/lib/charsets/utf8_core.c b/lib/charsets/utf8_core.c
> new file mode 100644
> index 000000000000..94427670e96e
> --- /dev/null
> +++ b/lib/charsets/utf8_core.c
> @@ -0,0 +1,178 @@
> +/*
> + * Copyright (c) 2017 Collabora Ltd.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it would be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + */
> +
> +#include <linux/charsets.h>
> +#include <linux/utf8norm.h>
> +#include <linux/slab.h>
> +#include <linux/parser.h>
> +#include <linux/string.h>
> +
> +static int utf8_strncmp(const struct charset *charset, const char *str1,
> +			const char *str2, int len)
> +{
> +	const struct utf8data *data = utf8nfkdi(charset->version);
> +	struct utf8cursor cur1, cur2;
> +	unsigned char c1, c2;
> +	int r, i;
> +
> +	r = utf8cursor(&cur1, data, str1);
> +	if (r < 0)
> +		return -EIO;
> +	r = utf8cursor(&cur2, data, str2);
> +	if (r < 0)
> +		return -EIO;
> +
> +	for (i = 0 ; i < len ; i++) {
> +		c1 = utf8byte(&cur1);
> +		c2 = utf8byte(&cur2);
> +
> +		if (!c1 || !c2 || c1 != c2)
> +			return 1;
> +
> +	}
> +
> +	return 0;
> +}

This function is broken, but the reasons why illustrate the traps and pitfalls of working
with utf8 code and limited length buffers.

As written, if str1 or str2 doesn't trip the len check, then utf8_strncmp() returns 1 (not equal).
It does this even if the strings are equal. The check in the loop would have to be something
like this instead:
		if (c1 != c2)
			return 1;
		if (!c1) /* implies !c2 as well */
			return 0;

But this is not the only problem. The 'len' limit applies to the input strings. So you need to tell
the utf8byte() routine that it applies. In other words, use utf8ncursor() which takes an additional
length parameter to set up the cursors.

With this change, utf8byte() will return 0 when it hits the end of the input string due to seeing a
null byte or having consumed all characters, provided that it is not in the middle of a utf8 sequence
or a an incomplete sequence of Unicode characters.

Finally, note that utf8byte() returns an int, not a char. It does this for the same reasons getc() does.

So utf8_strncmp() becomes something like the following. I'm using EINVAL instead of EIO, and note
that -EINVAL does not imply that str1 and str2 are not equal when compared as a sequence of bytes.

static int utf8_strncmp(const struct charset *charset,
			const char *str1,
			const char *str2,
			int len)
{
	const struct utf8data *data = utf8nfkdi(charset->version);
	struct utf8cursor cur1;
	struct utf8cursor cur2;
	int c1;
	int c2;

	if (utf8ncursor(&cur1, data, str1, len) < 0)
		return -EINVAL;
	if (utf8ncursor(&cur2, data, str2, len) < 0)
		return -EINVAL;

	do {
		c1 = utf8byte(&cur1);
		c2 = utf8byte(&cur2);

		if (c1 < 0 || c2 < 0)
			return -EINVAL;
		if (c1 != c2)
			return 1;
	} while (c1);

	return 0;
}


> +
> +static int utf8_strncasecmp(const struct charset *charset, const char *str1,
> +			    const char *str2, int len)
> +{
> +	const struct utf8data *data = utf8nfkdicf(charset->version);
> +	struct utf8cursor cur1, cur2;
> +	unsigned char c1, c2;
> +	int r, i;
> +
> +	r = utf8cursor(&cur1, data, str1);
> +	if (r < 0)
> +		return -EIO;
> +
> +	r = utf8cursor(&cur2, data, str2);
> +	if (r < 0)
> +		return -EIO;
> +
> +	for (i = 0 ; i < len ; i++) {
> +		c1 = utf8byte(&cur1);
> +		c2 = utf8byte(&cur2);
> +
> +		if (!c1 || !c2 || c1 != c2)
> +			return 1;
> +	}
> +
> +	return 0;
> +}

Same comments as above apply here.

> +
> +int utf8_casefold(const struct charset *charset, const char *str, int len,
> +		  char **folded_str)
> +{
> +	const struct utf8data *data = utf8nfkdicf(charset->version);
> +	struct utf8cursor cur;
> +	int i;
> +	char buffer[1024];
> +
> +	if (utf8cursor(&cur, data, str))
> +		return -EIO;
> +
> +	for (i = 0; i < (1024-1); i++) {
> +		buffer[i] = utf8byte(&cur);
> +		if (!buffer[i])
> +			break;
> +	}
> +	buffer[i] = '\0';
> +	*folded_str = kstrdup(buffer, GFP_NOFS);
> +	if (!*folded_str)
> +		return -ENOMEM;
> +
> +	return i;
> +}

I'm not sure a 1 buffer on the stack will be welcome. Maybe just use
utf8nlen() to get the target size and eat the cost of doing the normalization
twice. An advantage of using utf8nlen() is that it will validate the input string
as well.

Here too you should use utf8ncursor() to account for the len parameter. 

int utf8_casefold(const struct charset *charset,
		const char *str, int len,
		char **folded_str)
{
	const struct utf8data *data = utf8nfkdicf(charset->version);
	struct utf8cursor cur;
	char *s;
	int c;
	ssize_t size;

	size = utf8nlen(data, str, len);
	if (size < 0)
		return -EINVAL;
	s = kmalloc(size + 1, GFP_NOFS);
	if (!s)
		return -ENOMEM;
	*folded_string = s;
	/*
	 * utf8nlen() verified that str is well-formed, so
	 * utf8ncursor() and utf8byte() will not fail.
	 */
	utf8ncursor(&cur, data, str, len);
	do {
		c = utf8byte(&cur);
		*s++ = c;
	} while (c);

	return size;
}

The do-while loop could be written as follows as well, but IIRC that style is discouraged these days.

	while ((*s++ = utf8byte(&cur))
		;


> +
> +int utf8_normalize(const struct charset *charset, const char *str, int len,
> +		   char **normalization)
> +{
> +	const struct utf8data *data = utf8nfkdi(charset->version);
> +	struct utf8cursor cur;
> +	int i;
> +	char buffer[1024];
> +
> +	if (utf8cursor(&cur, data, str))
> +		return -EIO;
> +
> +	for (i = 0; i < (1024-1); i++) {
> +		buffer[i] = utf8byte(&cur);
> +		if (!buffer[i])
> +			break;
> +	}
> +	buffer[i] = '\0';
> +	*normalization = kstrdup(buffer, GFP_NOFS);
> +	if (!*normalization)
> +		return -ENOMEM;
> +
> +	return i;
> +}

Similar here.

> +
> +static const struct charset_ops utf8_ops = {
> +	.strncmp = utf8_strncmp,
> +	.strncasecmp = utf8_strncasecmp,
> +	.casefold = utf8_casefold,
> +	.normalize = utf8_normalize,
> +};
> +
> +static struct charset *utf8_load_charset(void *pargs)
> +{
> +	int maj, min, rev;
> +	unsigned int age;
> +	struct charset *charset;
> +	substring_t *args = pargs;
> +
> +	if (match_int(&args[0], &maj) || match_int(&args[1], &min) ||
> +	    match_int(&args[2], &rev))
> +		return NULL;
> +
> +	age = UNICODE_AGE(maj, min, rev);
> +
> +	if (!utf8version_is_supported(age))
> +		return NULL;

Maybe utf8version_is_supported() should be changed to take 'maj', 'min', 'rev' as separate parameters.

Olaf

> +
> +	charset = kmalloc(sizeof(struct charset), GFP_KERNEL);
> +	if (!charset)
> +		return NULL;
> +
> +	charset->info = NULL;
> +	charset->version = age;
> +	charset->ops = &utf8_ops;
> +
> +	return charset;
> +}
> +
> +static struct charset_info utf8_info = {
> +	.name = "utf8",
> +	.match_token = "utf8-%d.%d.%d",
> +	.load_charset = utf8_load_charset,
> +};
> +
> +static int __init init_utf8(void)
> +{
> +	charset_register(&utf8_info);
> +	return 0;
> +}
> +
> +static void __exit exit_utf8(void)
> +{
> +}
> +
> +module_init(init_utf8);
> +module_exit(exit_utf8);
> +MODULE_AUTHOR("Gabriel Krisman Bertazi");
> +MODULE_DESCRIPTION("UTF-8 charset operations for filesystems");
> +MODULE_LICENSE("GPL");
> +
> --
> 2.15.1

  reply	other threads:[~2018-01-12 10:38 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-12  7:12 [PATCH RFC 00/13] UTF-8 case insensitive lookups for EXT4 Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 01/13] charsets: Introduce middle-layer for character encoding Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 02/13] charsets: ascii: Wrap ascii functions to charsets library Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 03/13] charsets: utf8: Add unicode character database files Gabriel Krisman Bertazi
2018-01-12 16:59   ` Darrick J. Wong
2018-01-12 20:29     ` Weber, Olaf (HPC Data Management & Storage)
2018-01-13  0:24   ` Theodore Ts'o
2018-01-13  4:28     ` Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 04/13] scripts: add trie generator for UTF-8 Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 05/13] charsets: utf8: Introduce code for UTF-8 normalization Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 06/13] charsets: utf8: reduce the size of utf8data[] Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library Gabriel Krisman Bertazi
2018-01-12 10:38   ` Weber, Olaf (HPC Data Management & Storage) [this message]
2018-01-16 16:50     ` Gabriel Krisman Bertazi
2018-01-16 22:19       ` Weber, Olaf (HPC Data Management & Storage)
2018-01-23  3:33         ` Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 08/13] charsets: utf8: Introduce test module for kernel UTF-8 implementation Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 09/13] ext4: Add ignorecase mount option Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 10/13] ext4: Include encoding information on the superblock Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 11/13] fscrypt: Introduce charset-based matching functions Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 12/13] ext4: Support charset name matching Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 13/13] ext4: Implement ext4 dcache hooks for custom charsets Gabriel Krisman Bertazi
2018-01-12 10:52   ` Weber, Olaf (HPC Data Management & Storage)
2018-01-12 16:56 ` [PATCH RFC 00/13] UTF-8 case insensitive lookups for EXT4 Jeremy Allison

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DF4PR8401MB0812A452CD9D28F218A731D685170@DF4PR8401MB0812.NAMPRD84.PROD.OUTLOOK.COM \
    --to=olaf.weber@hpe.com \
    --cc=alvaro.soliverez@collabora.co.uk \
    --cc=david@fromorbit.com \
    --cc=kernel@lists.collabora.co.uk \
    --cc=krisman@collabora.co.uk \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=tytso@mit.edu \
    --subject='RE: [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).