Central Knowledge Base, OCR, and Semantic Search: The Foundation of Technological Change

Dec. 19, 2025, 1:49 p.m.
13
Jakub Smeda
Author
Jakub Smeda

Companies operating in the water and wastewater sector, installation contractors, and turnkey integrators handle hundreds of contracts, each with its own tender, execution, as-built, and warranty documentation. In a mid-sized organization running dozens of projects annually, the number of documents easily reaches tens of thousands of files. Data is scattered across emails, Excel spreadsheets, local drives, archives, intranets, and isolated tools. In practice, this means there is no single source of truth.

During an analysis conducted by experts from Z3X Tech-Care Group, including not only AI specialists but also a certified civil engineer specializing in the gas sector, Mr. Piotr Gizowski, employees pointed out that the greatest loss of knowledge and the highest number of errors occur at project handover stages: from sales to execution, and later to service. This is where information about arrangements, commercial terms, warranty clauses, and technical details is lost. Departments work in task-based silos without sharing full context. This is not a technological problem but an operational one. Technology merely exposes it.

What the document problem looks like in numbers

Let us consider a service and manufacturing company operating in the water and wastewater or installation sector, employing around one hundred office workers, designers, cost estimators, and service engineers. Each of them spends between ten and fifteen minutes per day trying to find the correct version of a file, an email with information sent in the past, or a document whose name they remember but whose location they do not.

On an annual basis, this amounts to between 2,500 and 3,500 working hours. At an internal hourly rate of PLN 150, this translates into PLN 360,000 to over PLN 500,000 in lost productivity. On top of that come the costs of errors resulting from working with outdated templates and the risk of ordering materials or services based on incorrect data. In one of the analyzed cases, merging data from two sources led to duplicated line items, resulting in more than PLN 50,000 in unnecessary purchases.

However, time is not the greatest cost. Risk is. A document is not just a file. In technical industries, it is an element of project and contractual value.

The Foundation of Technological Change

Why documentation in these industries is not an archive, but an operational product

A document is created during the tendering phase, changes during negotiations, becomes an instruction for execution, serves as evidence during acceptance, and forms the basis for warranty claims. This means that a document cannot simply be closed and “sent to the archive.” Its content changes function over time, and each subsequent version must remain accessible and controlled.

The lack of a central repository results in:

  • loss of knowledge when an employee leaves
  • delays in tendering and service
  • low predictability of information quality
  • legal risk in the event of audits, claims, or disputes with investors
  • no feedback loop between execution and tendering

Most critically, errors do not surface immediately. They emerge months later, most often during the warranty period.

Why OCR is critical and how a scanned PDF differs from a document understood by AI

Most documents in water, wastewater, and construction companies exist as PDF files, but many of them lack a text layer. For an IT system, a scanned PDF is an image. It cannot be searched for a serial number, component name, warranty expiry date, or scope of responsibility. An employee must open the document, scroll through it, read it, or manually copy the content.

OCR, or optical character recognition, transforms an image into searchable and analyzable text. The system detects content, recognizes characters, reads numbers, names, and technical parameters. This creates a text layer that can be indexed and analyzed semantically. AI can then answer questions, find fragments, point to sentences, compare values, or search hundreds of documents for a single contractual requirement.

In these industries, OCR is not a simple letter-recognition task. Documents include diagrams, drawings, tables, handwritten notes, stamps, and signatures. Scan quality directly affects recognition quality. If a document is dirty, skewed, or obscured, the system flags uncertainty markers. This introduces another key benefit of implementation: the system does not guess. Instead of misinterpreting data, it generates a list of specific problematic fields and routes them to a specialist.

For example, in a pump’s technical specification, the system recognizes the name but cannot read the serial number due to poor print quality. This parameter may affect spare-part compatibility or warranty conditions. The system creates a task, indicates the page and area it could not read, and the employee can enter the missing value in under a minute. Instead of reviewing a document dozens of pages long, the specialist receives only what requires attention.

This human–tool collaboration is more effective than attempting full automation. The system performs most of the work, while humans complete the missing five to ten percent of data. Combining these two elements delivers the highest reliability of the content index.

OCR is not used for archiving. It is a mechanism for recovering knowledge from paper documents that already exist and have been used by the company for years. In the context of later ticketing, warranty digitization, and bid analysis initiatives, it turns historical projects into an operational asset that supports future contracts.

Proof of Concept (PoC)

We completed a first proof justifying a full implementation, commonly referred to as a Proof of Concept (PoC). This PoC combined:

  • a central knowledge base
  • OCR
  • semantic search

The objective was to verify whether digitizing and tagging approximately 2,000 documents from a single pilot project and applying semantic indexing would reduce information retrieval time to under two minutes in more than 90 percent of cases. The PoC included a SharePoint-based document repository, taxonomy, metadata, versioning, and a prototype semantic search engine.

The pilot covered approximately 30 percent of the target functionality. We did not implement full document approval workflows or confidentiality-based access control, and the chatbot available in the Teams interface was deferred to the full project phase. The pilot focused on what mattered most: whether documents could be found quickly and accurately, and whether users would adopt the tool.

Results

Within this PoC:

  • we significantly reduced content search time on a limited sample
  • we organized document versioning, and usage log analysis allowed us to prepare a content gap map for the full project migration
  • we confirmed the necessity of defining taxonomy and publication templates, as without them users would introduce documents inconsistently

However, the most important outcome was not the metrics themselves. The PoC prepared users for change. Tool adoption was the key evaluation criterion of the pilot. Technology does not determine success—people using it do.

The Foundation of Technological Change

Risks and lessons learned

The biggest challenge was OCR quality for technical scans. Tool comparisons led to selecting a hybrid solution. Another risk was excessive dependence on one or two key employees for document tagging. The solution was to prepare a short “How to publish” guide and ten-minute micro-trainings. It was also essential to introduce anonymization rules so that documents containing personal data could be used in the PoC without compliance risks.

The pilot also revealed that employees want to use AI tools but hesitate due to regulatory concerns. Introducing policies and checklists proved to be a necessary component of further transformation.

Full project: an investment with a predictable return

The full project includes:

  • digitization of all documents
  • full metadata and versioning
  • approval workflows and GDPR compliance
  • integration with ticketing and CRM/ERP
  • role- and confidentiality-based access control
  • chatbot and natural language search

The ROI of the full project was estimated at 180–200 percent over 18–24 months, aligning with the expected outcomes defined in the transformation roadmap.

Why this must be the number one transformation project

The central knowledge base is the foundation for all subsequent PoCs, including ticketing, warranty digitization, bid automation, and deployment of a local AI model. Without a shared repository, version control, and access management, subsequent projects cannot be implemented without significant risk.

A sample ticketing system, discussed in another article, requires knowledge to accelerate diagnostics. Bid automation requires historical documents and data. An on-premise LLM requires a trusted content source for analysis. Sales and marketing dashboards require a single data language.

Summary

A central knowledge base combined with OCR and semantic search is not a documentation project. It is a risk management, cost management, and organizational know-how initiative. It is the foundation of technological change. It is also a decision about whether an organization retains control over its knowledge or allows it to leave with each departing specialist. It is the first step that determines the success of all subsequent ones.

8 min read
Share this post:

Ready to get started?

Take Your Business to the Next Level

Work with us
Work with z3x

Related Articles

All posts

Don't Want to Miss Anything?

Sign up for our Newsletter

Please provide your first name!
Please provide a valid email address!
* Yes, I agree to the terms and privacy policy.
Top