Inside the Engine: How We Teach a Computer to Spot PII
A clear guide on the different methods used to detect and redact personally identifiable information in documents.
3 min read
2025-05-01

Inside the Engine: How We Teach a Computer to Spot PII
A clear guide on the different methods used to detect and redact personally identifiable information in documents.
3 min read
2025-05-01
Most people think finding personally identifiable information (PII) is just a fancy form of find-and-replace. That works for phone numbers that follow standard formats, but it breaks the moment a document contains an unusual format—or a brand-new label the rules don't know yet. We wanted a detector that adapts to context, processes documents efficiently, and provides accurate results. Here's how our engine works.
This approach ensures we can handle documents of any size while maintaining high accuracy.
Humans spot sensitive information because of surrounding words—"John's social security number is 123-45-6789". The numeric pattern alone could be anything; the phrase "social security number" provides the critical context. Our system analyzes both the text patterns and their surrounding context to accurately identify PII.
The power lies in customization: you can specify exactly which types of PII you want to detect, from standard categories like emails and phone numbers to domain-specific identifiers like patient IDs or employee numbers.
You experience these optimizations as documents that process in seconds rather than minutes.
| Text Example | Category Detected | Confidence | Action |
| :---------------------------------------------- | :---------------- | :--------: | :------------ |
| john.doe@example.com | email
| 0.98 | ⬛ redact |
| 123-45-6789 | ssn
| 0.99 | ⬛ redact |
| "Patient #A12345" | patient_id
| 0.95 | ⬛ redact |
| "DOB: 01/15/1980" | date_of_birth
| 0.97 | ⬛ redact |
Our system combines high precision with excellent recall, ensuring sensitive information is properly identified.
Curious to see it in action? Try the live demo and watch an ordinary PDF turn squeaky-clean in under ten seconds.
© Copyright 2025 DataFog. All rights reserved.