How to tell sed “do not remove some characters”?

I have a text file containing Arabic characters and some other characters (punctuation marks, numbers, English characters, ... ). How can I tell sed to remove all the characters in the file, except Arabic ones? In short I can say that we typically tell sed to remove/replace some specific characters and print others, but now I am looking for a way to tell sed just print my desired characters, and remove all other characters.

Answers


With GNU sed, you should be able to specify characters by their hex code. You can use those in a a character class:

sed 's/[\x00-\x7F]//g' # hex notation
sed 's/[\o000-\o177]//g' # octal notation

You should also be able to achieve the same effect with the tr command:

tr -d '[\000-\177]'

Both methods assume UTF8 encoding of your input file. Multi-byte characters have their highest bit set, so you can simply strip everything that's a standard ASCII (7 bits) character.


To keep everything except some well defined characters, use a negative character classe:

sed 's/[^characters you want to keep]//g'

Using a pattern alike to [^…]\+ might improve performance of the regex.


Need Your Help

In Visual Studio, when would I want to use the Test View?

visual-studio unit-testing visual-studio-2008 visual-studio-2005

For managing unit tests in Visual Studio, I use the Test List Editor. There's also a Test View which looks similar but more limited. When would I want to use the Test View as opposed to the Test List

get direction vector from btScalar matrix

c++ opengl vector matrix bullet

I have created a bullet vehicle with a compound as the chassis and the compound is formed of 2 bodies, a chassis and a turret.

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.