Inside the Engine: How We Teach a Computer to Spot PII

A clear guide on the different methods used to detect and redact personally identifiable information in documents.

3 min read

•

2025-05-01

Inside the Engine: How We Teach a Computer to Spot Pii

Most people think finding personally identifiable information (PII) is just a fancy form of find-and-replace. That works for phone numbers that follow standard formats, but it breaks the moment a document contains an unusual format—or a brand-new label the rules don't know yet. We wanted a detector that adapts to context, processes documents efficiently, and provides accurate results. Here's how our engine works.

1. Smart Chunking, Efficient Processing

Document chunking – We break large documents into manageable chunks (2000 characters each) to ensure efficient processing without losing context.
Label-aware detection – Our API accepts both text and a list of PII label types you want to detect, making the system highly customizable.

This approach ensures we can handle documents of any size while maintaining high accuracy.

2. What "Context" Really Means

Humans spot sensitive information because of surrounding words—"John's social security number is 123-45-6789". The numeric pattern alone could be anything; the phrase "social security number" provides the critical context. Our system analyzes both the text patterns and their surrounding context to accurately identify PII.

The power lies in customization: you can specify exactly which types of PII you want to detect, from standard categories like emails and phone numbers to domain-specific identifiers like patient IDs or employee numbers.

3. Speed Optimizations under the Hood

Parallel chunk processing – We process document chunks in parallel to minimize overall processing time.
Offset preservation – When chunking documents, we carefully preserve the original character offsets so redactions appear in the correct positions.
Efficient API integration – Our streamlined API communication minimizes latency and overhead.

You experience these optimizations as documents that process in seconds rather than minutes.

4. Detection Examples

| Text Example | Category Detected | Confidence | Action | | :---------------------------------------------- | :---------------- | :--------: | :------------ | | john.doe@example.com | email | 0.98 | ⬛ redact | | 123-45-6789 | ssn | 0.99 | ⬛ redact | | "Patient #A12345" | patient_id | 0.95 | ⬛ redact | | "DOB: 01/15/1980" | date_of_birth | 0.97 | ⬛ redact |

Our system combines high precision with excellent recall, ensuring sensitive information is properly identified.

5. Why We Built Our Own Solution

Flexibility – Need to detect new types of PII? Simply add the label to your API request.
Transparency – Clear confidence scores help you understand why text was flagged.
Cost efficiency – Our optimized processing keeps computational requirements low.

6. What's Next

Domain-specific models – We're developing specialized models for healthcare, finance, and legal documents.
Browser-based processing – Soon, simple documents will be processed entirely in your browser for maximum privacy.
Continuous improvement – User feedback helps us refine detection accuracy for edge cases.

Curious to see it in action? Try the live demo and watch an ordinary PDF turn squeaky-clean in under ten seconds.