Using an ML model for OCR
I've been working on a number of simple problems I want to solve using AI models. One of those requires OCR (Optical Character Recognition), which is the process of converting different types of documents, such as scanned paper documents, PDF files, or images captured by a digital camera, into editable and searchable data.
Finding a model
I found a pre-trained model on the Hugging Face model hub called ucaslcl/GOT-OCR2_0
. For those interested, the details of this model are available here. The goal is a proof-of-concept to see how well it works, and eventually merging into a deployable container.
Note: (https://huggingface.co/)[Hugging Face] is an incredible resource for finding pre-trained models for a variety of tasks. They have a large community of developers and hosting solutions.
NVIDIA CUDA toolkit
This model requires the use of a GPU. Specifically, I'm using the NVIDIA CUDA toolkit to get this setup. You can find the latest version of the CUDA toolkit here.
Once that's installed, you can start setting up your Python environment.
Setting up environment
https://developer.nvidia.com/cuda-downloads to download the latest version of pytorch
https://pytorch.org/get-started/locally/ to find the command
Setting up environment
Installing Python
mkdir ocr
cd ocr
py -m venv ./venv
venv\Scripts\activate
You know it's activated when you see (venv)
in the terminal.
Installing PyTorch
python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
python -m pip install transformers tiktoken verovio accelerate
Installing other prerequisites
python -m pip install transformers tiktoken verovio accelerate
Now you can generate a requirements.txt file
python -m pip freeze > requirements.txt
To install the requirements.txt file
python -m pip install -r requirements.txt
Create the Script
Add an image with text in it to the folder and call it sample-image.png
.
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('ucaslcl/GOT-OCR2_0', trust_remote_code=True)
model = AutoModel.from_pretrained('ucaslcl/GOT-OCR2_0', trust_remote_code=True, low_cpu_mem_usage=True, device_map='cuda', use_safetensors=True, pad_token_id=tokenizer.eos_token_id)
model = model.eval().cuda()
# input your test image
image_file = 'sample-image.png'
# plain texts OCR
res = model.chat(tokenizer, image_file, ocr_type='ocr')
print(res)
To deactivate the virtual environment
deactivate