Null consistency

CAD97 · June 7, 2022, 9:23pm

We have ptr::null and ptr::is_null, but CStr::from_bytes_with_nul and "nul-terminated" strings.

I think it would be more consistent to replace (deprecate and reëxport) all instances of nul with null.

What do you all think? If we have rough consensus here, I'll prepare a PR for T-libs-api FCP.

josh · June 7, 2022, 9:30pm

nul and null are two different things. null is a pointer to address 0, while nul is the name of the character with value 0.

That distinction is sometimes elided (people often say "null-terminated" rather than "nul-terminated"), but that's the distinction the API names are based on.

mjbshaw · June 7, 2022, 9:30pm

IIRC "NUL" is the abbreviation blessed by ASCII/Unicode for the null character. I'm not sure how I feel regarding the proposal, but a part of me likes that we can differentiate between NUL (the character) and null (the pointer).

jhpratt · June 7, 2022, 10:25pm

I think a doc alias should suffice, as the two aren't quite the same as already stated.

jdahlstrom · June 8, 2022, 1:17am

The similarity of the established terms is unfortunate, but the distinction is important. from_bytes_with_null would confuse me because in my mind "null" always means a null pointer.

scottmcm · June 8, 2022, 2:11am

I'm really not convinced it's worth it. The "did you mean?" hint for it should be nigh perfect if you try ptr::nul

error[E0425]: cannot find function `nul` in module `std::ptr`
   --> src/main.rs:2:15
    |
2   |     std::ptr::nul();
    |               ^^^ help: a function with a similar name exists: `null`

or from_bytes_with_null

error[E0599]: no function or associated item named `from_bytes_with_null` found for struct `CStr` in the current scope
 --> src/main.rs:3:11
  |
3 |     CStr::from_bytes_with_null();
  |           ^^^^^^^^^^^^^^^^^^^^
  |           |
  |           function or associated item not found in `CStr`
  |           help: there is an associated function with a similar name: `from_bytes_with_nul`

And that seems fine to me.

Note that, according to https://www.unicode.org/Public/14.0.0/ucd/UnicodeData.txt, the name of U+0000 is "NULL", not "NUL":

0000;<control>;Cc;0;BN;;;;;N;NULL;;;;
0001;<control>;Cc;0;BN;;;;;N;START OF HEADING;;;;
0002;<control>;Cc;0;BN;;;;;N;START OF TEXT;;;;

Though admittedly https://www.unicode.org/Public/14.0.0/ucd/NameAliases.txt does list NUL as an "abbreviation"

0000;NULL;control
0000;NUL;abbreviation
0001;START OF HEADING;control
0001;SOH;abbreviation

But that does mean that null is a perfectly correct way to refer to U+0000, not something that only means the null pointer. (Arguably the non-abbreviation is generally better, as elaborated 0580-rename-collections - The Rust RFC Book, but I suppose abbreviations do make sense in method names because print_with_line_feed is a bit excessive compared to print_with_lf.)

lordan · June 9, 2022, 1:08pm

Yes, but the 0-byte at the end of a CStr is just that : a byte in a series of (otherwise non-0) bytes, not a UTF-8 encoding. Thus the naming follows the ASCII convention (or perhaps even preceding ASCII?):

#	Abbr.	Description	#	Abbr.	Description
0	NUL	Null	16	DLE	Data Link Escape
1	SOH	Start of Header	17	DC1	Device Control 1
2	STX	Start of Text	18	DC2	Device Control 2
3	ETX	End of Text	19	DC3	Device Control 3
4	EOT	End of Transmission	20	DC4	Device Control 4
5	ENQ	Enquiry	21	NAK	Negative Acknowledge
6	ACK	Acknowledge	22	SYN	Synchronize
7	BEL	Bell	23	ETB	End of Transmission Block
8	BS	Backspace	24	CAN	Cancel
9	HT	Horizontal Tab	25	EM	End of Medium
10	LF	Line Feed	26	SUB	Substitute
11	VT	Vertical Tab	27	ESC	Escape
12	FF	Form Feed	28	FS	File Separator
13	CR	Carriage Return	29	GS	Group Separator
14	SO	Shift Out	30	RS	Record Separator
15	SI	Shift In	31	US	Unit Separator

felix.s · July 10, 2022, 10:08am

I have been annoyed by people speaking of ‘NULL-terminated strings’ and writing char c = NULL; enough times to appreciate, for once, a programming language clearly distinguishing the pointer to nothing from code point zero. They are named differently because they are different things, and it’s about time people learned that.

(Relatedly, about the only thing I like about Go is that it named its Unicode scalar value type ‘rune’, in order to distance the programmer, if only just slightly, from the misconception that Unicode scalars are isomorphic to ‘characters’.)

Well, not quite. C0 and C1 control codes don’t officially have names in Unicode, only formal aliases; they had names in version 1.0, but those were withdrawn in version 1.1, and control character aliases were introduced only in version 6.1. The field you are pointing at in the UnicodeData.txt file is the ‘Unicode 1.0 name’ (property Unicode_1_Name, na1). This is what allowed the introduction of U+1F514 BELL in Unicode 6.0, despite U+0007 being known under that name in version 1.0.

system · October 8, 2022, 10:08am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Convenient null-terminated string literals libs	15	7572	June 16, 2021
`&CStr` from/to `&[c_char]`, safely libs	15	1195	June 28, 2020
CStr::in_bytes method libs	4	551	June 17, 2021
CStr API extensions libs	0	207	November 6, 2024
PtrOption and PtrResult	3	680	November 11, 2019

Null consistency

Related topics