[Discuss] Unicode grep
Alan W. Irwin
irwin at beluga.phys.uvic.ca
Tue Apr 29 19:46:28 PDT 2008
On 2008-04-29 18:21-0700 Michael Foltinek wrote:
> Hi, all,
> I run Slackware and OpenBSD, and neither of their grep manpages
> mention Unicode. However, when I googled it, there's an online linux
> manpage that talks about parsing Unicode (though it's more complicated
> than that, it seems).
>
> So, my question to you all is: who has a grep that understands
> Unicode, and what distro is it in?
That's an interesting question. When I googled for it myself, there was a
2003 Debian discussion that seemed to indirectly imply the ordinary Linux
grep was already UTF8 aware. Indeed, I have access to some source code with
UTF8 strings so I tried grepping for one of those strings (the Kurdish word
for peace, Hasîtî), and that string was found.
Note, UTF8 is a particular unicode encoding that is so convenient (for
example, ascii just maps to ascii) that it is banishing all the other possible
unicode encodings as well as the deprecated iso-style character sets.
So assuming the encoding of a particular unicode document is UTF8 (which is
the default in Linux), I think you should just go ahead using ordinary grep
to search that document, and you should be fine.
For more on unicode and unicode-enabled fonts, see links collected
at http://unifont.org/.
Alan
__________________________
Alan W. Irwin
Astronomical research affiliation with Department of Physics and Astronomy,
University of Victoria (astrowww.phys.uvic.ca).
Programming affiliations with the FreeEOS equation-of-state implementation
for stellar interiors (freeeos.sf.net); PLplot scientific plotting software
package (plplot.org); the libLASi project (unifont.org/lasi); the Loads of
Linux Links project (loll.sf.net); and the Linux Brochure Project
(lbproject.sf.net).
__________________________
Linux-powered Science
__________________________
More information about the Discuss
mailing list