Using an ML model for OCR

I've been working on a number of simple problems I want to solve using AI models. One of those requires OCR (Optical Character Recognition), which is the process of converting different types of documents, such as scanned paper documents, PDF files, or images captured by a digital camera, into editable and searchable data.

Finding a model

I found a pre-trained model on the Hugging Face model hub called ucaslcl/GOT-OCR2_0. For those interested, the details of this model are available here. The goal is a proof-of-concept to see how well it works, and eventually merging into a deployable container.

Note: (https://huggingface.co/)[Hugging Face] is an incredible resource for finding pre-trained models for a variety of tasks. They have a large community of developers and hosting solutions.

NVIDIA CUDA toolkit

This model requires the use of a GPU. Specifically, I'm using the NVIDIA CUDA toolkit to get this setup. You can find the latest version of the CUDA toolkit here.

Once that's installed, you can start setting up your Python environment.

Setting up environment

https://developer.nvidia.com/cuda-downloads to download the latest version of pytorch

https://pytorch.org/get-started/locally/ to find the command

Setting up environment

Installing Python

mkdir ocr
cd ocr
py -m venv ./venv
venv\Scripts\activate

You know it's activated when you see (venv) in the terminal.

Installing PyTorch

python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
python -m pip install transformers tiktoken verovio accelerate

Installing other prerequisites

python -m pip install transformers tiktoken verovio accelerate

Now you can generate a requirements.txt file

python -m pip freeze > requirements.txt

To install the requirements.txt file

python -m pip install -r requirements.txt

Create the Script

Add an image with text in it to the folder and call it sample-image.png.

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('ucaslcl/GOT-OCR2_0', trust_remote_code=True)
model = AutoModel.from_pretrained('ucaslcl/GOT-OCR2_0', trust_remote_code=True, low_cpu_mem_usage=True, device_map='cuda', use_safetensors=True, pad_token_id=tokenizer.eos_token_id)
model = model.eval().cuda()


# input your test image
image_file = 'sample-image.png'

# plain texts OCR
res = model.chat(tokenizer, image_file, ocr_type='ocr')

print(res)

To deactivate the virtual environment

deactivate

References