situation: You work in the operations team of a medium-sized company. Every day, your team processes order forms from different B2B customers. All of them arrive as PDFs. And in theory, they all contain the same information: customer ID, purchase order number, delivery date, and the ordered items.
In practice, however, every document looks slightly different: One customer places the purchase order number in the top-left corner, the next one in the bottom-right corner. Some write “PO Number”, others use “Order ID”, “Order Reference”, or something completely different.
For us humans, this is usually not a problem. We look at the document, understand the context, and immediately recognize which information is meant.
For traditional automation systems, however, this becomes difficult: A regex rule can specifically search for “PO Number: “. But what happens if the next customer uses “Order Reference: “ instead?
That is exactly the problem I recreated for this article.
We compare two different approaches for extracting structured data from B2B order forms:
- A traditional rule-based approach using pytesseract and regex rules
- An LLM-based approach using pytesseract, Ollama, and LLaMA 3
The goal of this article is not to show that LLMs are generally better. They are not always.
A much more interesting question is: At what point do traditional extraction pipelines start to reach their limits as complexity and the number of different layouts increase? And when can an LLM actually reduce maintenance effort?
Table of Contents
1 – Step-by-Step Guide
2 – Head-to-Head Comparison
3 – When should we NOT use an LLM?
4 – Final Thoughts
Where to Continue Learning?
1 – Step-by-Step Guide
We rebuild both approaches step by step. First, we create two sample PDFs containing the same business information but using different layouts. Afterwards, we extract the data once with a traditional OCR and regex pipeline and once with an OCR and LLM pipeline. This allows us to compare both approaches under identical conditions.
- The traditional approach basically asks:
“Can I find the exact pattern that I programmed?” - The LLM-based approach instead asks:
“Can I understand the meaning of this field in context?”
→ 🤓 Find the full code in the GitHub Repo 🤓 ←
Before We Start — Mise en Place
pip vs. Anaconda
In this guide, we use pip, Python’s standard package manager. This means we install all libraries directly through the command line using pip install …. pip is already included automatically when you install Python. If you know Python tutorials that work with Anaconda, that is simply another way to achieve the same goal (using conda install …). In the article “Python Data Analysis Ecosystem — A Beginner’s Roadmap”, you can find further details about getting started with Python. Additionally, on a Microsoft device we use the CMD terminal (Windows key + R > click on cmd).
Create and activate a new virtual environment
Create a new python environment with python –m venv b2bdocumentextractor (you can change the name) in a terminal and activate it withb2bdocumentextractorScriptsactivate.
Optional: Check Python and pip
python --version
pip --version
You should see a Python and a pip version.
Step 1 – Install Tesseract
Tesseract is the OCR engine. It is the tool that actually reads text from images or scanned PDFs using OCR (Optical Character Recognition). pytesseract is only the Python bridge to Tesseract. This means: Our Python code can communicate with Tesseract through pytesseract, but the real text recognition is done by Tesseract itself. Without installing Tesseract first, pytesseract cannot work.
First, we download the latest .exe-file for w64 and run the installer:
GitHub – Tesseract at UB Mannheim
Important: Remember the installation path:
C:Program FilesTesseract-OCR
Inside the CMD terminal, we verify the installation using the following command:
"C:Program FilesTesseract-OCRtesseract.exe" --version
If everything worked correctly, we should see the corresponding Tesseract version.
Step 2 – Install Poppler
Next, we install pdf2image. This is our library for converting PDFs into images and it requires Poppler in the background. Poppler is an open-source PDF rendering library used to display PDF files.
For this, we download the latest version of Poppler, extract the ZIP file, and move the extracted folder to the C: drive.
GitHub-Poppler Windows Releases
Inside the folder, click on Library > bin and save the path where you stored the folder on your C: drive. On my machine, it looks like this:
C:Usersschuepoppler-26.02.0Librarybin
Additionally, we add the path to the PATH variable so Windows knows where Poppler is located.
Hint for Newbies:
Press the Windows key and search for Edit environment variables. Afterwards click on Edit the system environment variables. Then click on Environment Variables. Under User variables, select the variable PATH, click on Edit, then New, and paste the path.
Now restart CMD so the changes are applied.

Step 3 – Install Python Libraries
Now we install all Python libraries we need. Make sure you reactivate the Python environment beforehand:
- pytesseract: We install this library as the bridge between Python and Tesseract. We already installed Tesseract as the OCR engine, but only with pytesseract can Python communicate with it directly.
- pdf2image: pytesseract is an OCR engine, which means it recognizes text from pixels in an image. It cannot read PDF structures directly. pdf2image therefore performs an intermediate step: It renders each PDF page as an image, similar to a screenshot, so that pytesseract can analyze it afterwards. Note: If we had digital PDFs (meaning PDFs where you can select and copy text), we could directly extract the text using libraries such as pdfplumber or PyMuPDF. However, since we assume that B2B order forms are often scans in practice, we take the detour through pdf2image.
- pillow: pdf2image and pytesseract use this image-processing library in the background (we do not directly see the usage in the code) to correctly process images.
fpdf2: We use this library to automatically generate two test PDFs (Layout A and Layout B) via script for the article example.
ollama: This library allows our Python script to send messages to the LLM and receive responses.

Step 4 – Install Ollama and Download LLaMA 3
Once the installation of the libraries worked successfully, we install Ollama and LLaMA 3 as the LLM. Ollama is the tool that allows us to run LLMs completely free, locally on our laptop, and without API keys.
First, we install Ollama. If you have not already done this, you can download the Windows installer from Ollama and execute it.
Afterwards, we download LLaMA 3 using the following command:
ollama pull llama3
Depending on your internet connection, this step may take some time since approximately 4.7 GB are downloaded. However, we can see a progress bar in the terminal.

Afterwards, we verify whether everything worked:
ollama list
If you see something similar to the screenshot, it worked successfully.

Step 5 – Create the Project Folder and Generate Test PDFs
For this comparison, we create two B2B order forms for Alpha GmbH and Beta AG that contain the same information but use different layouts. In this example, we assume that the order forms are scans, which is why we previously installed pdf2image (for digital PDFs, this would also be possible with libraries such as pdfplumber or PyMuPDF).
First, we create a project folder to store all files there:
mkdir document_extractor
cd document_extractor
Next, we create a new file called create_test_pdfs.py and insert the following code that you can find in this GitHub-Gist. We save this file inside the previously created folder document_extractor:
https://gist.github.com/Sari95/a52a62eb78e0604c4d8c64f5cdd1160a
Now we return to the terminal and execute the file:
python create_test_pdfs.py
Inside the folder, we can now see the two newly created PDFs:

In the two PDFs, we can already see the problem:
- They contain the same information.
- But the PDFs use completely different field names and a different date format.
Approach 1: The Traditional Way (pytesseract + Regex Rules)
The traditional approach works in two steps:
- First, we convert the PDF into an image. Afterwards, we use pytesseract to read the image and extract the raw text via OCR (Optical Character Recognition). Put simply, OCR means that the tool “looks” at the image and tries to recognize letters from pixels. Quite similar to how humans decipher handwritten notes.
- In the second step, we use regex. These are regular expressions that search for specific patterns inside the text. For example, we can define: “Search for everything that comes after
PO Number:.”
Already in this second step, we can identify the first problem: What happens if the customer simply writes “Order Reference” instead of “PO Number: “?
In that case, the regex pattern finds nothing. What we can then do (or must do) is add a new rule.
Execute Script 1 for Approach 1
Next, we create a new file called approach1_traditional.py with the following code that you can find in the GitHub-Gist inside the same folder:
https://gist.github.com/Sari95/aa2be6938fbcb1c7f94b053d9046f55d
Now we execute the file again inside the terminal:
python approach1_traditional.py
The Result of Approach 1
For Layout A, everything works perfectly:
For Layout B? Not a single field is recognized and all values return “None”:

And this is exactly where the problem lies. For every new customer, new regex rules would have to be written, tested, and deployed. With 200 customers, that means 200 different patterns. And every time a customer slightly changes their form, the system breaks again.
Approach 2: A New Way (pytesseract + Ollama + LLaMA 3)
In this second approach, we keep the OCR step, but replace the rigid regex rules with an LLM:
- pytesseract still reads the text from the PDF.
- Instead of telling the code “Search for PO Number: ”, we tell the LLM: “Here is an order document. Extract these fields for me, regardless of how they are named.”
The LLM understands the semantic context. It recognizes that “Order Reference” and “PO Number” mean the same thing, even without an explicit rule.
Execute Script 2 for Approach 2
Now, we create a new file called approach2_llm.py with the following code that you can find in the GitHub-Gist inside the same folder:
https://gist.github.com/Sari95/d4e9e83490a9fbf34a3776d1604f8742
Now we execute the file again inside the terminal. Make sure that Ollama is still running in the background:
python approach2_llm.py
The Result of Approach 2
What we can now see is that both layouts are correctly recognized:

For both layouts, the information from the differently named fields is correctly extracted and assigned, even though not a single regex expression was adjusted and no new template was created. The LLM understands both layouts because it reads the context. Additionally, the date format from Layout B is directly normalized to match the format from Layout A.
2 – Head-to-Head Comparison
After both tests, one thing quickly becomes clear: Technically, both approaches solve the same problem.
Both approaches have their own advantages and disadvantages:

With regex-based pipelines, the complexity lives in the rules and maintenance effort. With LLM-based pipelines, the complexity shifts toward infrastructure, inference time, and model behavior. For medium-sized companies processing many customer-specific layouts, that trade-off can become strategically more important than pure extraction accuracy.
3 – When should we NOT use an LLM?
At the moment, it often feels as if every existing automation process suddenly needs to be replaced with AI or LLMs.
In practice, however, this is not always the better solution. Especially medium-sized companies usually do not need to build the “most modern” solution, but rather the one that remains stable, maintainable, and economically reasonable in the long term. Depending on the situation, that can be the traditional regex-based approach, while in other cases switching to an LLM may make more sense.
Some situations where the traditional approach may still be the more suitable option:
- The documents are stable and standardized:
If a company only processes a few known layouts and these rarely change, regex is often the better solution.Why?
Because the additional benefit of an LLM becomes small, while the overall system complexity increases.
A stable rule-based process, on the other hand, is faster, cheaper, easier to debug, and easier to hand over to new people.
- Speed and throughput are critical:
In our example, the LLM processes one document within 20–40 seconds.At first, that sounds acceptable. But once we imagine ourselves inside a real production environment, the perspective changes quickly.
A medium-sized company probably processes orders, delivery notes, invoices, customs documents, support documents, etc. And not 10 times per day, but 10,000 times per day.
In this situation, inference time suddenly becomes a real infrastructure issue. Regex-based systems run significantly faster, whereas LLMs require more RAM, more CPU/GPU power, and often additional queueing or batch-processing mechanisms.
- Explainability is more important than flexibility:
Especially in regulated industries such as pharma, insurance, banking, or healthcare, it is often necessary to fully understand why a specific value was extracted.Regex rules are clearly deterministic: One line of code produces one clearly explainable result. LLMs, on the other hand, work probabilistically: The model interprets the context and returns the most likely result. This is exactly what makes LLMs flexible, but at the same time also more difficult to audit.
- The company does not have the right infrastructure:
In our example, we used Ollama. Getting started was generally simple. Nevertheless, it should not be underestimated that memory consumption, GPU resources, monitoring, or response times under load can look very different when working with LLMs.
On my Substack Data Science Espresso, I share practical guides and bite-sized updates from the world of Data Science, Python, AI, Machine Learning, and Tech — made for curious minds like yours.
Have a look and subscribe on Medium or on Substack if you want to stay in the loop.
4 – Final Thoughts
Choosing the right approach is not necessarily a technical question, but rather a strategic one.
The traditional approach tries to explicitly describe every possible document. The LLM-based approach instead tries to understand meaning and context. For small and stable environments, the traditional approach is often completely sufficient. The more layouts and edge cases appear, the more difficult it becomes to keep the rules maintainable in the long term. That is exactly where LLMs start to become interesting.
It can also be an exciting entry-level use case for a company to start working with an LLM here and, in doing so, make the company ready for AI and gain initial practical experience.

