How to tell sed “do not remove some characters”?
I have a text file containing Arabic characters and some other characters (punctuation marks, numbers, English characters, ... ). How can I tell sed to remove all the characters in the file, except Arabic ones? In short I can say that we typically tell sed to remove/replace some specific characters and print others, but now I am looking for a way to tell sed just print my desired characters, and remove all other characters.
With GNU sed, you should be able to specify characters by their hex code. You can use those in a a character class:
sed 's/[\x00-\x7F]//g' # hex notation sed 's/[\o000-\o177]//g' # octal notation
You should also be able to achieve the same effect with the tr command:
tr -d '[\000-\177]'
Both methods assume UTF8 encoding of your input file. Multi-byte characters have their highest bit set, so you can simply strip everything that's a standard ASCII (7 bits) character.
To keep everything except some well defined characters, use a negative character classe:
sed 's/[^characters you want to keep]//g'
Using a pattern alike to [^…]\+ might improve performance of the regex.