LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* The kernel's ctype
@ 2015-02-04 11:18 Rasmus Villemoes
  2015-02-12 11:10 ` Frans Klaver
  0 siblings, 1 reply; 2+ messages in thread
From: Rasmus Villemoes @ 2015-02-04 11:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton

Hi,

The kernel's ctype is almost, but not quite, equivalent to latin1. Apart
from whether one wants to include the C1 control chars (0x80-0x9f),
there are a few other differences. For example, 0xb5 (MICRO SIGN) is, at
least according to glibc, both alpha and lower, while the kernel
classifies it as punct. A slightly surprising quirk of the kernel's
ctype implementation is that toupper() is not idempotent: Both 0xdf
(LATIN SMALL LETTER SHARP S) and 0xff (LATIN SMALL LETTER Y WITH
DIAERESIS) are correctly classified as lower, but since neither
character's uppercase version is representable in latin1, correct
toupper() behaviour would be to return the character itself. Instead, we
have toupper(0xff) == 0xdf and toupper(0xdf) == 0xbf.

Digging in pre-git history, I see that ctype.c was originally ASCII-only,
which I think is the only sane choice. It was changed around 1996, but
the commit log that I've found just says "Import 2.0.1", so it's hard to
tell what the intention was.

What would break if ctype.c was changed back to ASCII?

Rasmus

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: The kernel's ctype
  2015-02-04 11:18 The kernel's ctype Rasmus Villemoes
@ 2015-02-12 11:10 ` Frans Klaver
  0 siblings, 0 replies; 2+ messages in thread
From: Frans Klaver @ 2015-02-12 11:10 UTC (permalink / raw)
  To: Rasmus Villemoes; +Cc: linux-kernel, Andrew Morton

On Wed, Feb 4, 2015 at 12:18 PM, Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
> Hi,
>
> The kernel's ctype is almost, but not quite, equivalent to latin1. Apart
> from whether one wants to include the C1 control chars (0x80-0x9f),
> there are a few other differences. For example, 0xb5 (MICRO SIGN) is, at
> least according to glibc, both alpha and lower, while the kernel
> classifies it as punct. A slightly surprising quirk of the kernel's
> ctype implementation is that toupper() is not idempotent: Both 0xdf
> (LATIN SMALL LETTER SHARP S) and 0xff (LATIN SMALL LETTER Y WITH
> DIAERESIS) are correctly classified as lower, but since neither
> character's uppercase version is representable in latin1, correct
> toupper() behaviour would be to return the character itself. Instead, we
> have toupper(0xff) == 0xdf and toupper(0xdf) == 0xbf
>
> Digging in pre-git history, I see that ctype.c was originally ASCII-only,
> which I think is the only sane choice. It was changed around 1996, but
> the commit log that I've found just says "Import 2.0.1", so it's hard to
> tell what the intention was.
>
> What would break if ctype.c was changed back to ASCII?

The implementation of toupper() and tolower() still seems to be
assuming that we're dealing with ascii only, so in that regard I don't
think that much would break, as it should be broken already.

Shouldn't coccinelle be able to detect ctype usage and use of
non-ascii or userland input values?

Frans

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2015-02-12 11:10 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-04 11:18 The kernel's ctype Rasmus Villemoes
2015-02-12 11:10 ` Frans Klaver

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).