When you copy-paste data from PDF:s, for example, you can get undesirable characters included like control characters or exotic whitespaces or some other stuff that I don't know everything about. Python has an isprintable() method to distinguishing them from printable characters that you're more likely to wanna feed into your program.
Would be great if Rust also had this method on the char type.
isprintable in python doesn't always do what you want. For example, zero-width space (U+200B) and non-breaking space (U+00A0) tab ('\t', U+0009) and newline ('\n', U+000A) are all considered as non-printable. Python's definition of "printable" only cares about whether repr() would need to escape it. Getting a universally agreed upon definition of what's printable is probably not possible, and will depend on the exact use case.
I don’t think it’s feasible to define "printable" in Unicode, except as !ch.is_control() perhaps. Numerous code points modify adjacent code points, and their "printability" is only meaningful in context, on the grapheme cluster level. I guess the direct analogy to Python is "what does <char as Debug>::fmt() escape", which actually… I’m not sure what it does and does not.
I think I must have accidentally leveraged this private function in the crate I just made that implements the is_printable() method the way I imagined (but I still think it should be available by default in standard library):
The way the compiler works is that some characters will get printed (i.e. parsed into an actual character), while others will instead produce the unicode value like \u{7} in the case of the bell character - at least when you're using the Debug trait.
I leveraged this discrimination of characters to implement an is_printable method
The most foolproof method to do this is probably to use this function to get the general category of the character, and decide which categories should count as "printable".
Indeed, and deciding that is not trivial and it really also depends on your use case. In some cases "printable" may even in 2024 simply mean char::is_ascii_graphic.
I wrote a program that goes through all the unicodes 1 - 0x99_999, and it identifies correctly which unicodes are printable or not. It prints with green color "{..} is printable" and what the character is, if it's printable. But if it's not printable, it will print "{..} is unprintable" and how the compiler displays the character. You can copy-paste it into a new project and run it after u import the colored crate:
printability identifier
use std::{thread::sleep, time::Duration};
use colored::*;
fn main() {
let min_value = 0x1;
// let min_value = 75;
// let max_value = 120;
let max_value = 0x99_999;
for unicode in min_value..=max_value {
let Some(character) = char::from_u32(unicode) else {
println!("{unicode} is unprintable");
continue;
};
let escape_debug = character.escape_debug().to_string();
let is_special_printable = is_special_printable(&escape_debug);
let escapes = escape_debug.starts_with('\\');
let single_char = escape_debug.len() == 1;
let typical_printable = !escapes || single_char;
if typical_printable || is_special_printable {
println!("{unicode} is {}", "printable".green());
dbg!(character);
} else {
println!("{unicode} is {}", "unprintable".red());
dbg!(character);
}
println!();
println!();
sleep(Duration::from_millis(100));
}
}
pub fn is_special_printable(escape_debug: &str) -> bool {
let character = escape_debug.chars().last().unwrap();
let is_special_printable = matches!(character, '\'' | '\"' | '\\' | '/');
is_special_printable
}
No method is globally applicable. That's why different methods/functions/structs/etc are used for different use cases and for different situations. Rust already prints some characters but not others (in the latter case it prints the escape code instead), so the goal of is_printable() is to turn that into a method that returns a boolean
Your implementation is much more involved than it needs to be. It can just be |c: char| matches!(c, '\"' | '\'' | '\\') || c.escape_debug().count() == 1). (Note that this is counts char, not bytes.) However, escape_debug for strsays
Note: only extended grapheme codepoints that begin the string will be escaped.
so checking each codepoint individually will give a different result than the string as a whole.
Separately, a simple test is at best a heuristic. If you actually want to know anything, you need to query the font rendering stack, because that's what actually decides what it means to "print" some text.
I added a comprehensive visual inspection unit test for confirming that each character is correctly processed. I also simplified the code pretty much as you suggested.