Null consistency

We have ptr::null and ptr::is_null, but CStr::from_bytes_with_nul and "nul-terminated" strings.

I think it would be more consistent to replace (deprecate and reëxport) all instances of nul with null.

What do you all think? If we have rough consensus here, I'll prepare a PR for T-libs-api FCP.

2 Likes

nul and null are two different things. null is a pointer to address 0, while nul is the name of the character with value 0.

That distinction is sometimes elided (people often say "null-terminated" rather than "nul-terminated"), but that's the distinction the API names are based on.

26 Likes

IIRC "NUL" is the abbreviation blessed by ASCII/Unicode for the null character. I'm not sure how I feel regarding the proposal, but a part of me likes that we can differentiate between NUL (the character) and null (the pointer).

8 Likes

I think a doc alias should suffice, as the two aren't quite the same as already stated.

1 Like

The similarity of the established terms is unfortunate, but the distinction is important. from_bytes_with_null would confuse me because in my mind "null" always means a null pointer.

3 Likes

I'm really not convinced it's worth it. The "did you mean?" hint for it should be nigh perfect if you try ptr::nul

error[E0425]: cannot find function `nul` in module `std::ptr`
   --> src/main.rs:2:15
    |
2   |     std::ptr::nul();
    |               ^^^ help: a function with a similar name exists: `null`

or from_bytes_with_null

error[E0599]: no function or associated item named `from_bytes_with_null` found for struct `CStr` in the current scope
 --> src/main.rs:3:11
  |
3 |     CStr::from_bytes_with_null();
  |           ^^^^^^^^^^^^^^^^^^^^
  |           |
  |           function or associated item not found in `CStr`
  |           help: there is an associated function with a similar name: `from_bytes_with_nul`

And that seems fine to me.


Note that, according to https://www.unicode.org/Public/14.0.0/ucd/UnicodeData.txt, the name of U+0000 is "NULL", not "NUL":

0000;<control>;Cc;0;BN;;;;;N;NULL;;;;
0001;<control>;Cc;0;BN;;;;;N;START OF HEADING;;;;
0002;<control>;Cc;0;BN;;;;;N;START OF TEXT;;;;

Though admittedly https://www.unicode.org/Public/14.0.0/ucd/NameAliases.txt does list NUL as an "abbreviation"

0000;NULL;control
0000;NUL;abbreviation
0001;START OF HEADING;control
0001;SOH;abbreviation

But that does mean that null is a perfectly correct way to refer to U+0000, not something that only means the null pointer. (Arguably the non-abbreviation is generally better, as elaborated 0580-rename-collections - The Rust RFC Book, but I suppose abbreviations do make sense in method names because print_with_line_feed is a bit excessive compared to print_with_lf.)

5 Likes

Yes, but the 0-byte at the end of a CStr is just that : a byte in a series of (otherwise non-0) bytes, not a UTF-8 encoding. Thus the naming follows the ASCII convention (or perhaps even preceding ASCII?):

# Abbr. Description # Abbr. Description
0 NUL Null 16 DLE Data Link Escape
1 SOH Start of Header 17 DC1 Device Control 1
2 STX Start of Text 18 DC2 Device Control 2
3 ETX End of Text 19 DC3 Device Control 3
4 EOT End of Transmission 20 DC4 Device Control 4
5 ENQ Enquiry 21 NAK Negative Acknowledge
6 ACK Acknowledge 22 SYN Synchronize
7 BEL Bell 23 ETB End of Transmission Block
8 BS Backspace 24 CAN Cancel
9 HT Horizontal Tab 25 EM End of Medium
10 LF Line Feed 26 SUB Substitute
11 VT Vertical Tab 27 ESC Escape
12 FF Form Feed 28 FS File Separator
13 CR Carriage Return 29 GS Group Separator
14 SO Shift Out 30 RS Record Separator
15 SI Shift In 31 US Unit Separator
2 Likes

I have been annoyed by people speaking of ‘NULL-terminated strings’ and writing char c = NULL; enough times to appreciate, for once, a programming language clearly distinguishing the pointer to nothing from code point zero. They are named differently because they are different things, and it’s about time people learned that.

(Relatedly, about the only thing I like about Go is that it named its Unicode scalar value type ‘rune’, in order to distance the programmer, if only just slightly, from the misconception that Unicode scalars are isomorphic to ‘characters’.)

Well, not quite. C0 and C1 control codes don’t officially have names in Unicode, only formal aliases; they had names in version 1.0, but those were withdrawn in version 1.1, and control character aliases were introduced only in version 6.1. The field you are pointing at in the UnicodeData.txt file is the ‘Unicode 1.0 name’ (property Unicode_1_Name, na1). This is what allowed the introduction of U+1F514 BELL in Unicode 6.0, despite U+0007 being known under that name in version 1.0.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.