https://www.facebook.com/itzonepakistan
×

Archives

  • May 2025
  • April 2025
  • March 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2021
  • February 2021
  • December 2020
  • November 2020
  • April 2019

Categories

  • Business
  • DMS
  • Networking
  • Technology
  • Tips
  • Uncategorized

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

HOW TO SHOP

1 Login or create new account.
2 Review your order.
3 Payment & FREE shipment

If you still have problems, please let us know, by sending an email to support@website.com . Thank you!

SHOWROOM HOURS

Mon-Fri 9:00AM - 6:00AM
Sat - 9:00AM-5:00PM
Sundays by appointment only!
social sharing

SIGN IN YOUR ACCOUNT TO HAVE ACCESS TO DIFFERENT FEATURES

FORGOT YOUR PASSWORD?

FORGOT YOUR DETAILS?

AAH, WAIT, I REMEMBER NOW!
QUESTIONS? CALL: 03144 166 777
  • LOGIN
  • SUPPORT

IT Zone Pakistan

IT Zone Pakistan

IT Zone Pakistan | Graphics, Web Design, ERP, Document Scanning Services, 3d interior design

T (31) 44 166 777
Email: sales@itzonepakistan.com

IT Zone Pakistan
II Chundriger Road Uni Plaza Karachi-Pakistan

Open in Google Maps
  • Home – IT Zone
  • About Us
  • Our Services
    • Office Paper Shredding Service – Free of Charge!
    • Document Scanning Services
    • Document Management Software
    • Office Computer Scrap Buying
  • Shop
  • BLOG & STORIES
    • EVENTS
  • Contact Us
  • MY CART
    No products in cart.
FREEQUOTE
  • Home
  • BLOG & STORIES
  • Uncategorized
  • How Johnson & Johnson India Built an In-House OCR Tool to Scan 4 Million Documents – AIM
June 1, 2025

How Johnson & Johnson India Built an In-House OCR Tool to Scan 4 Million Documents – AIM

How Johnson & Johnson India Built an In-House OCR Tool to Scan 4 Million Documents – AIM

by admin / Monday, 10 February 2025 / Published in Uncategorized

source

When the data science team at Johnson & Johnson in India faced the daunting challenge of extracting text from nearly 4 million documents, they had two choices. They could either rely on third-party Optical Character Recognition (OCR) services at a significant cost or develop an in-house solution tailored to their needs. 

Venkata Karthik T, senior manager of data science at Johnson & Johnson, revealed his journey at MLDS 2025 and offered an insightful look into how necessity, innovation, and cost constraints led them to build a powerful, scalable OCR tool.

“At the end of the day, cost is a very critical factor for any project. So taking care of that is very important,” Karthik explained. Additionally, using third-party solutions raised privacy concerns. 

The project, spanning over a year with multiple iterations, demonstrated the power of in-house innovation. “It’s more like we picked up what was really important and started implementing them. 95% of the problem gets solved with these things,” Karthik noted.

Handling sensitive internal documents required greater control over the data. Moreover, building an internal tool offered the advantage of customisation, enabling them to tailor the OCR engine to various use cases beyond document scanning.

Venkata Karthik T speaking at MLDS 2025

The Thought Process Behind Building the Tool In-House

Karthik said that his team evaluated multiple OCR frameworks, prioritising activity metrics, capabilities, and usability. They created datasets for testing, categorising documents into digital, noisy, and handwritten types. 

Digital documents were generated using SynthTIGER, noisy documents sourced from the FunSD dataset, and handwritten text taken from the IAM dataset. After rigorous testing, they narrowed their focus to four key OCR models: PaddleOCR, Tesseract, EasyOCR, and HDR Pipeline. 

Each framework had its strengths and weaknesses. PaddleOCR excelled at table extraction but struggled with dense text. Tesseract worked well on dense text but had issues with tables. HDR Pipeline performed best for handwritten text. 

“So instead of choosing one, we combined PaddleOCR and Tesseract. We took both outputs and saw which one had the highest confidence score. If one model identified text and another missed it, we merged results, improving overall accuracy,” Karthik said. 

The existing process was straightforward: send PDFs or images to a third-party service, extract text, and use it for downstream applications. However, the team sought to replicate and improve this with their own OCR pipeline. 

The tool was evaluated using word error rate (WER), character error rate (CER), and accuracy. While third-party APIs achieved nearly 98-99% accuracy on digital and noisy documents, the hybrid model significantly improved internal performance. HDR Pipeline was particularly effective for handwritten text, achieving 85% accuracy.

How AI Helped with Cost – the Biggest Factor

Cost efficiency was undoubtedly another key consideration. “We don’t want the API up and running all day. It’s an unnecessary cost,” Karthik said. Instead of a costly front-end UI, the team deployed a backend API on Kubernetes. They optimised infrastructure using batch processing, streaming mode, and event-based triggering to ensure the system ran only when needed.

To further refine text extraction, the team introduced AI-powered error correction. Using ChatGPT for low-confidence words improved accuracy by 3%. They also experimented with fine-tuned BERT models to correct OCR mistakes without relying on expensive third-party APIs.

Additionally, they developed six pre-built extraction templates to streamline data retrieval. These templates allowed users to specify areas of interest, such as key-value pairs, structured tables, or spatial relationships within documents. This reduced the need for manual adjustments and sped up adoption within the organisation.

Since many documents contained tabular data, the team leveraged Microsoft’s Table Transformer. This model identified tables and their components, including rows, columns, and headers, before feeding them into PaddleOCR for text extraction.

For barcodes, they used a combination of YOLOv5 for detection and multiple open-source decoders. If these failed, they applied super-resolution techniques to enhance barcode clarity, boosting decoding accuracy to 84%.

Despite significant progress, Karthik said that challenges remain with handwritten text, where even humans struggle to decipher poor handwriting. However, the team is optimistic about integrating vision-language models (VLMs) like OCR-free RAG models to bypass traditional OCR altogether.

  • Tweet

About admin

What you can read next

CCI Recruitment 2022: Salary up to 105000, Check Posts, Eligibility and How to Apply Here – StudyCafe
Raven Compact Document Scanner – PCMag Australia
Kodak i1190 review – IT Pro

Recent Posts

  • Global Document Scanning Software Market Size by Application, – openPR.com

    source...
  • Canon imageFORMULA DR-M1060ii review: Fast, A3 capable desktop scanning in a small package – IT Pro

    source...
  • Alevon partners with iDenfy for identity verification services – The Paypers

    source...
  • Scan to Apple Notes on iPhone & iPad: Easy Guide – Geeky Gadgets

    source...
  • This app is making desktop scanners obsolete – Mashable

    source...

Recent Comments

    Featured Posts

    • Global Document Scanning Software Market Size by Application, – openPR.com

      0 comments
    • Canon imageFORMULA DR-M1060ii review: Fast, A3 capable desktop scanning in a small package – IT Pro

      0 comments
    • Alevon partners with iDenfy for identity verification services – The Paypers

      0 comments
    • Scan to Apple Notes on iPhone & iPad: Easy Guide – Geeky Gadgets

      0 comments
    • This app is making desktop scanners obsolete – Mashable

      0 comments

    Archives

    • May 2025
    • April 2025
    • March 2025
    • February 2025
    • January 2025
    • December 2024
    • November 2024
    • October 2024
    • September 2024
    • January 2023
    • December 2022
    • November 2022
    • October 2022
    • September 2022
    • August 2022
    • July 2021
    • February 2021
    • December 2020
    • November 2020
    • April 2019

    Categories

    • Business
    • DMS
    • Networking
    • Technology
    • Tips
    • Uncategorized

    Meta

    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org

    GET A FREE QUOTE

    Please fill this for and we'll get back to you as soon as possible!

    FACEBOOK

    2,175
    LIKES

    TWITTER

    1,050
    Followers

    PINTEREST

    101
    follower

    FOOTER MENU

    • Terms and Conditions
    • F.A.Q.
    • Our Services
    • BLOG & STORIES

    NEWSLETTER SIGNUP

    By subscribing to our mailing list you will always be update with the latest news from us.

    We never spam!

    GET IN TOUCH

    II Chundriger Road Uni Plaza Karachi-Pakistan
    Email: Info@Itzonepakistan.com
    Phone:
    Direct+92-314-4166-777
    Sales+92-313-8854-133

    Social Platform

    • Tweet
    • Pin It

    RSS ARY NEWS

    • All set for by-election in PP-52 May 31, 2025
    • GET SOCIAL
    IT Zone Pakistan

    Copyright @2024-25. All rights reserved | Design & Develop IT Zone Pakistan.

    TOP