New Textractor release, what's new?

Textractor has been under slow development during the last few months. I'm happy to announce that a new release has been submitted to Jolla Harbour and it should be available in near future on the Jolla Store. If you don't want to wait, you can also download it straight from GitHub.

The new version has multiple new features. As I wrote earlier, one of them is perspective correction. This means that you will be able to select an area from the image before the OCR process. Textractor will try correct the perspective distortion of the selected area and passes the resulting new cropped image to the Tesseract OCR after preprocessing it.

OCR process can be now also observed on the application cover page. In addition to that, the preprocessed image is shown during the OCR process (note: some images may be shown completely black, but the image should be fine nevertheless). Camera orientation can be now locked in portrait or landscape mode. Automatic mode is of course still available too.


Another interesting new feature is text extraction from PDF files. I had to design the UI for the file picker from scratch because SFOS still has no ready component for it. I also took the latest QML FolderListModel from Qt 5.6 upstream and shipped it with the app because I couldn't import the actual SFOS FolderListModel due to Harbour limitations.

When I was trying to get the file picker working "some easy way" I did some interesting hacking with QFileSystemModel too. It depends on Qt widgets for some unknown reason. After some googling I found out that it requires some icon stuff from OS so I overwrote one method with dummy implementation and got it working... almost. Everything was fine but the file listing in UI never worked when I opened a folder for the first time. After going back up and opening the folder again everything was fine. However, this was unacceptable behaviour and the code never saw daylight.

After selecting a PDF file user can select desired pages and proceed to OCR. This method is mainly intented to be used for example with scanned PDF files. It works well with normal PDF files too but as the text can be usually copied from them it makes no sense to run OCR for such files.