Optical Character Recognition

Using tesseract to OCR meroitic text

circulos meos

A tesseract is, in geometry, the four-dimensional analog of the cube.

But this post won’t deal with geometric adventures, as I did on some previous one.

Tesseract is also an OCR software devoted to the extraction of text from printed (scanned) material.

Meroitic was a language and script used in Meroë and the Sudan during the Meroitic period (attested from 300 BCE) and which went extinct about 400 CE. For purposes beyond this discussion, I needed to OCR some meroitic text in hieroglyphic form. Btw, maybe -or maybe not- these purposes were related with some derivative work from Cthulhu Mythos.

So, to begin with, I had some pages written in meroitic which I wanted to transliterate to latin alphabet. Meroitic alphabet is pretty reduced:

Meroitic alphabet (from wikimedia)As there is no language data for meroitic on tesseract’s site, we’ll have to “train” tesseract to recognize it. Fortunately it’s…

View original post 1,011 more words

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s