Linux-Fsdevel Archive on lore.kernel.org
help / color / mirror / Atom feed
From: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
To: "Weber\, Olaf \(HPC Data Management & Storage\)" <olaf.weber@hpe.com>
Cc: "tytso\@mit.edu" <tytso@mit.edu>,
	"david\@fromorbit.com" <david@fromorbit.com>,
	"linux-ext4\@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	"linux-fsdevel\@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"kernel\@lists.collabora.co.uk" <kernel@lists.collabora.co.uk>,
	"alvaro.soliverez\@collabora.co.uk"
	<alvaro.soliverez@collabora.co.uk>
Subject: Re: [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library
Date: Tue, 16 Jan 2018 14:50:33 -0200	[thread overview]
Message-ID: <874lnlzr1i.fsf@collabora.co.uk> (raw)
In-Reply-To: <DF4PR8401MB0812A452CD9D28F218A731D685170@DF4PR8401MB0812.NAMPRD84.PROD.OUTLOOK.COM> (Olaf Weber's message of "Fri, 12 Jan 2018 10:38:25 +0000")

"Weber, Olaf (HPC Data Management & Storage)" <olaf.weber@hpe.com>
writes:

> But this is not the only problem. The 'len' limit applies to the input strings. So you need to tell
> the utf8byte() routine that it applies. In other words, use utf8ncursor() which takes an additional
> length parameter to set up the cursors.
>
> With this change, utf8byte() will return 0 when it hits the end of the input string due to seeing a
> null byte or having consumed all characters, provided that it is not in the middle of a utf8 sequence
> or a an incomplete sequence of Unicode characters.
>
> Finally, note that utf8byte() returns an int, not a char. It does this for the same reasons getc() does.
>
> So utf8_strncmp() becomes something like the following. I'm using EINVAL instead of EIO, and note
> that -EINVAL does not imply that str1 and str2 are not equal when compared as a sequence of bytes.
>
> static int utf8_strncmp(const struct charset *charset,
> 			const char *str1,
> 			const char *str2,
> 			int len)
> {
> 	const struct utf8data *data = utf8nfkdi(charset->version);
> 	struct utf8cursor cur1;
> 	struct utf8cursor cur2;
> 	int c1;
> 	int c2;
>
> 	if (utf8ncursor(&cur1, data, str1, len) < 0)
> 		return -EINVAL;
> 	if (utf8ncursor(&cur2, data, str2, len) < 0)
> 		return -EINVAL;
>
> 	do {
> 		c1 = utf8byte(&cur1);
> 		c2 = utf8byte(&cur2);
>
> 		if (c1 < 0 || c2 < 0)
> 			return -EINVAL;
> 		if (c1 != c2)
> 			return 1;
> 	} while (c1);
>
> 	return 0;
> }

Hi Olaf,

Thanks for your review and deep explanations.

I get your point and I've added a test case to trigger it in the
test_ucd module.

One question that I have, on the other hand: Take the version you
shared, I want to avoid the -EINVAL for the case when strings s1
and s2 should match as equal, but strlen(s1) < strlen (s2).  In this
case:

strncmp (s1, s2, strlen (s2)) => Returns 0.  Matches Ok
strncmp (s1, s2, strlen (s1)) => Returns -EINVAL

I know -EINVAL doesn't mean they don't match, but this case seems too
error prone.

I suppose we just could:

 (1) let the caller deal with it, which is error prone.  Or,

 (2) Require two lens on strncmp, one for each string, Or,

 (3) use utf8cursor for the second string, which plays bad with non-null
 terminated strings, which is important for filesystems.

Do you see an alternative? I'm pending towards option 2.  Are you ok
with that?

-- 
Gabriel Krisman Bertazi

  reply	other threads:[~2018-01-16 16:50 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-12  7:12 [PATCH RFC 00/13] UTF-8 case insensitive lookups for EXT4 Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 01/13] charsets: Introduce middle-layer for character encoding Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 02/13] charsets: ascii: Wrap ascii functions to charsets library Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 03/13] charsets: utf8: Add unicode character database files Gabriel Krisman Bertazi
2018-01-12 16:59   ` Darrick J. Wong
2018-01-12 20:29     ` Weber, Olaf (HPC Data Management & Storage)
2018-01-13  0:24   ` Theodore Ts'o
2018-01-13  4:28     ` Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 04/13] scripts: add trie generator for UTF-8 Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 05/13] charsets: utf8: Introduce code for UTF-8 normalization Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 06/13] charsets: utf8: reduce the size of utf8data[] Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library Gabriel Krisman Bertazi
2018-01-12 10:38   ` Weber, Olaf (HPC Data Management & Storage)
2018-01-16 16:50     ` Gabriel Krisman Bertazi [this message]
2018-01-16 22:19       ` Weber, Olaf (HPC Data Management & Storage)
2018-01-23  3:33         ` Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 08/13] charsets: utf8: Introduce test module for kernel UTF-8 implementation Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 09/13] ext4: Add ignorecase mount option Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 10/13] ext4: Include encoding information on the superblock Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 11/13] fscrypt: Introduce charset-based matching functions Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 12/13] ext4: Support charset name matching Gabriel Krisman Bertazi
2018-01-12  7:12 ` [PATCH RFC 13/13] ext4: Implement ext4 dcache hooks for custom charsets Gabriel Krisman Bertazi
2018-01-12 10:52   ` Weber, Olaf (HPC Data Management & Storage)
2018-01-12 16:56 ` [PATCH RFC 00/13] UTF-8 case insensitive lookups for EXT4 Jeremy Allison

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874lnlzr1i.fsf@collabora.co.uk \
    --to=krisman@collabora.co.uk \
    --cc=alvaro.soliverez@collabora.co.uk \
    --cc=david@fromorbit.com \
    --cc=kernel@lists.collabora.co.uk \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=olaf.weber@hpe.com \
    --cc=tytso@mit.edu \
    --subject='Re: [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).