How Johnson & Johnson India Built an In-House OCR Tool to Scan 4 Million Documents – AIM

by admin / Monday, 10 February 2025 / Published in Uncategorized

When the data science team at Johnson & Johnson in India faced the daunting challenge of extracting text from nearly 4 million documents, they had two choices. They could either rely on third-party Optical Character Recognition (OCR) services at a significant cost or develop an in-house solution tailored to their needs.

Venkata Karthik T, senior manager of data science at Johnson & Johnson, revealed his journey at MLDS 2025 and offered an insightful look into how necessity, innovation, and cost constraints led them to build a powerful, scalable OCR tool.

“At the end of the day, cost is a very critical factor for any project. So taking care of that is very important,” Karthik explained. Additionally, using third-party solutions raised privacy concerns.

The project, spanning over a year with multiple iterations, demonstrated the power of in-house innovation. “It’s more like we picked up what was really important and started implementing them. 95% of the problem gets solved with these things,” Karthik noted.

Handling sensitive internal documents required greater control over the data. Moreover, building an internal tool offered the advantage of customisation, enabling them to tailor the OCR engine to various use cases beyond document scanning.

The Thought Process Behind Building the Tool In-House

Karthik said that his team evaluated multiple OCR frameworks, prioritising activity metrics, capabilities, and usability. They created datasets for testing, categorising documents into digital, noisy, and handwritten types.

Digital documents were generated using SynthTIGER, noisy documents sourced from the FunSD dataset, and handwritten text taken from the IAM dataset. After rigorous testing, they narrowed their focus to four key OCR models: PaddleOCR, Tesseract, EasyOCR, and HDR Pipeline.

Each framework had its strengths and weaknesses. PaddleOCR excelled at table extraction but struggled with dense text. Tesseract worked well on dense text but had issues with tables. HDR Pipeline performed best for handwritten text.

“So instead of choosing one, we combined PaddleOCR and Tesseract. We took both outputs and saw which one had the highest confidence score. If one model identified text and another missed it, we merged results, improving overall accuracy,” Karthik said.

The existing process was straightforward: send PDFs or images to a third-party service, extract text, and use it for downstream applications. However, the team sought to replicate and improve this with their own OCR pipeline.

The tool was evaluated using word error rate (WER), character error rate (CER), and accuracy. While third-party APIs achieved nearly 98-99% accuracy on digital and noisy documents, the hybrid model significantly improved internal performance. HDR Pipeline was particularly effective for handwritten text, achieving 85% accuracy.

How AI Helped with Cost – the Biggest Factor

Cost efficiency was undoubtedly another key consideration. “We don’t want the API up and running all day. It’s an unnecessary cost,” Karthik said. Instead of a costly front-end UI, the team deployed a backend API on Kubernetes. They optimised infrastructure using batch processing, streaming mode, and event-based triggering to ensure the system ran only when needed.

To further refine text extraction, the team introduced AI-powered error correction. Using ChatGPT for low-confidence words improved accuracy by 3%. They also experimented with fine-tuned BERT models to correct OCR mistakes without relying on expensive third-party APIs.

Additionally, they developed six pre-built extraction templates to streamline data retrieval. These templates allowed users to specify areas of interest, such as key-value pairs, structured tables, or spatial relationships within documents. This reduced the need for manual adjustments and sped up adoption within the organisation.

Since many documents contained tabular data, the team leveraged Microsoft’s Table Transformer. This model identified tables and their components, including rows, columns, and headers, before feeding them into PaddleOCR for text extraction.

For barcodes, they used a combination of YOLOv5 for detection and multiple open-source decoders. If these failed, they applied super-resolution techniques to enhance barcode clarity, boosting decoding accuracy to 84%.

Despite significant progress, Karthik said that challenges remain with handwritten text, where even humans struggle to decipher poor handwriting. However, the team is optimistic about integrating vision-language models (VLMs) like OCR-free RAG models to bypass traditional OCR altogether.

Archives

Categories

Meta

HOW TO SHOP

SHOWROOM HOURS