Isprintable() method for strings

When you copy-paste data from PDF:s, for example, you can get undesirable characters included like control characters or exotic whitespaces or some other stuff that I don't know everything about. Python has an isprintable() method to distinguishing them from printable characters that you're more likely to wanna feed into your program.

Would be great if Rust also had this method on the char type.

isprintable in python doesn't always do what you want. For example, zero-width space (U+200B) and non-breaking space (U+00A0) tab ('\t', U+0009) and newline ('\n', U+000A) are all considered as non-printable. Python's definition of "printable" only cares about whether repr() would need to escape it. Getting a universally agreed upon definition of what's printable is probably not possible, and will depend on the exact use case.

10 Likes

I don’t think it’s feasible to define "printable" in Unicode, except as !ch.is_control() perhaps. Numerous code points modify adjacent code points, and their "printability" is only meaningful in context, on the grapheme cluster level. I guess the direct analogy to Python is "what does <char as Debug>::fmt() escape", which actually… I’m not sure what it does and does not.

2 Likes

I also don't think an exact match is very useful.

>>> string = '👨‍👩‍👧‍👧'
>>> print(string.isprintable()) 
False
>>> string = '\n'
>>> print(string.isprintable()) 
False
>>> string = 'Y̴̛ͭͥ͊ͤ̄͛̓̚oͬ̄ͫ̉̃ͦ̑ͪ̀͢ư̂̾̅͋ͣͩ̂̋͢rͦͨ̾ͣ̑̿̄̍͏̷ ̔͑͛̑͛̂̓ͯ̕͜t̡ͫ̿̏̊̀̏̓̚͢e̴͐̈́͂ͤͩͫ́ͭ͝ẍ̸̢̄͊̿ͨͥ̽̌t̶̉̓̂̈́ͧ̋ͯ̚͘ ̶ͫ̅̈́̔̾͒̽̍͡h͋ͩͯͧ͛̃͒̚͟͢eͪ̂̐̾̔ͭ͑ͪ̕͠rͪ͐ͩ̋̈́ͣ̑ͭ́͠e̶ͯ̑͗̂̓̀̆̊͜'
>>> print(string.isprintable()) 
True

And Unicode is so general I don't think there's actually a good, globally applicable solution.

Most things I can think of would involve restriction to more legacy region-specific encoding. The only such thing in std is ASCII.

10 Likes

Interestingly, I found an is_printable private function inside the standard library. I wonder what its definition of printability is.

3 Likes

I think I must have accidentally leveraged this private function in the crate I just made that implements the is_printable() method the way I imagined (but I still think it should be available by default in standard library):

The way the compiler works is that some characters will get printed (i.e. parsed into an actual character), while others will instead produce the unicode value like \u{7} in the case of the bell character - at least when you're using the Debug trait.

I leveraged this discrimination of characters to implement an is_printable method

Your crate is buggy. It checks if the debug formatting of a character ends with }. But the debug formatting of all characters actually end with '.

1 Like

Also, even if you fix that, it will report '}' as not printable.

1 Like

The most foolproof method to do this is probably to use this function to get the general category of the character, and decide which categories should count as "printable".

2 Likes

Indeed, and deciding that is not trivial and it really also depends on your use case. In some cases "printable" may even in 2024 simply mean char::is_ascii_graphic.

1 Like

I wrote a program that goes through all the unicodes 1 - 0x99_999, and it identifies correctly which unicodes are printable or not. It prints with green color "{..} is printable" and what the character is, if it's printable. But if it's not printable, it will print "{..} is unprintable" and how the compiler displays the character. You can copy-paste it into a new project and run it after u import the colored crate:

printability identifier
use std::{thread::sleep, time::Duration};

use colored::*;

fn main() {
    let min_value = 0x1;
    // let min_value = 75;
    // let max_value = 120;
    let max_value = 0x99_999;

    for unicode in min_value..=max_value {
        let Some(character) = char::from_u32(unicode) else {
            println!("{unicode} is unprintable");
            continue;
        };

        let escape_debug = character.escape_debug().to_string();

        let is_special_printable = is_special_printable(&escape_debug);

        let escapes = escape_debug.starts_with('\\');

        let single_char = escape_debug.len() == 1;
        let typical_printable = !escapes || single_char;

        if typical_printable || is_special_printable {
            println!("{unicode} is {}", "printable".green());
            dbg!(character);
        } else {
            println!("{unicode} is {}", "unprintable".red());
            dbg!(character);
        }

        println!();
        println!();
        sleep(Duration::from_millis(100));
    }
}

pub fn is_special_printable(escape_debug: &str) -> bool {
    let character = escape_debug.chars().last().unwrap();

    let is_special_printable = matches!(character, '\'' | '\"' | '\\' | '/');
    is_special_printable
}

I fixed the bug. Try the 0.0.5 version.

No method is globally applicable. That's why different methods/functions/structs/etc are used for different use cases and for different situations. Rust already prints some characters but not others (in the latter case it prints the escape code instead), so the goal of is_printable() is to turn that into a method that returns a boolean

Your implementation is much more involved than it needs to be. It can just be |c: char| matches!(c, '\"' | '\'' | '\\') || c.escape_debug().count() == 1). (Note that this is counts char, not bytes.) However, escape_debug for str says

Note: only extended grapheme codepoints that begin the string will be escaped.

so checking each codepoint individually will give a different result than the string as a whole.

Separately, a simple test is at best a heuristic. If you actually want to know anything, you need to query the font rendering stack, because that's what actually decides what it means to "print" some text.

2 Likes

I applied your feedback in the version 0.0.11.

I added a comprehensive visual inspection unit test for confirming that each character is correctly processed. I also simplified the code pretty much as you suggested.

Thank you for feedback!

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.