Skip to content
<-- back
| 6 min | antonio maiolo
// ai, privacy, gdpr, architecture, security

Your AI Reads Customer Data — A Proxy Solved My GDPR Problem

A client needed Claude Opus or GPT-5.2. Not “something comparable” — those specific models, for customer-facing AI features that couldn’t ship with anything less. One problem: neither is available hosted in the EU. GPT-5.2 doesn’t exist in any Azure EU Data Zone. Anthropic stores data in the US by default.

That’s how I ended up co-building Noirdoc — a proxy that strips PII before it reaches any model, so it doesn’t matter where the model is hosted. Here’s the decision framework behind it.

The Model You Want Doesn’t Live Here

The frontier models are in the US. OpenAI launched EU data residency in early 2025, but it’s limited — only new API projects, not all endpoints, and inference residency is still constrained. Azure OpenAI has Data Zone deployments in Sweden Central, but GPT-5.2? Not there. No timeline either. Anthropic offers EU routing through Bedrock and Vertex, but their default API stores data stateside.

So you’re stuck. The model your client needs runs on US infrastructure. Every API call with customer data — names, emails, IBANs, complaint details — is a GDPR Article 44 data transfer to a third country. The Italian regulator banned ChatGPT for a month in 2023 over exactly this kind of thing, then followed up with a €15M fine in December 2024.

You can sign DPAs all day. That covers the provider’s obligations. It doesn’t change the fact that you sent raw PII across the Atlantic.

Four Approaches, Three Dead Ends

ApproachData leaves network?Model qualityIntegration effortProduction-ready?
Hope and DPAYes (raw PII)FrontierNoneLegally risky
On-premises modelsNoOpen-sourceVery high (infra)Months
Fine-tune on sanitized dataNo (at inference)DegradedHigh (pipeline)Fragile
Pseudonymization proxyYes (no PII)FrontierMinimal (URL swap)Days

Hope and DPA is what most companies do. Sign the processing agreement, move on. Works until a regulator looks closely — and they’re looking more closely every year.

On-premises models means running Llama or Mistral on your own GPUs. No data leaves. But you need ML engineers, a GPU fleet, and you’re shipping a worse product. If the client specifically needs GPT-5.2 reasoning quality, “run Llama instead” is not an answer.

Fine-tuning on sanitized data sounds clean until you build the sanitization pipeline. German compound names, multi-language content, addresses in free text — edge cases everywhere. The pipeline breaks, models drift, and you still need to sanitize every inference request.

A pseudonymization proxy strips PII before the request hits the API. The AI sees <<PERSON_1>> complained about invoice <<ID_1>> instead of real data. Response comes back with placeholders, proxy restores the originals. You use whatever model you want. The data that reaches the provider is clean.

How Pseudonymization Actually Works

The integration is a URL swap:

client = OpenAI(
    api_key="your-noirdoc-key",
    base_url="https://api.noirdoc.de/v1"
)

No SDK changes. No code refactoring. Your existing OpenAI, Anthropic, or Azure integration stays identical — the proxy sits in front and handles detection, replacement, and restoration transparently.

Two detection layers run in parallel: rule-based patterns catch structured PII (IBANs, emails, phone numbers), while context-sensitive NER catches names in free text. The regex gets hans.mueller@company.de. The NER gets “please forward this to Schmidt” — no email, no structured field, just a surname in a sentence.

Session state stays consistent. <<PERSON_1>> always references the same person across messages and tool calls within a conversation. The AI reasons about entities correctly — it just never sees who they actually are.

The Gotchas Nobody Warns You About

False positives are worse than false negatives. Over-aggressive detection breaks prompts. Replace “Berlin” with <<LOCATION_1>> and the AI’s geographic reasoning collapses. Replace “März” because it looks like a surname and your date handling falls apart. I spent more time on false positive reduction than on the proxy itself.

Structured data fights back. JSON payloads, function calling arguments, structured outputs — the proxy needs to understand document structure, not just scan for patterns. Early versions mangled JSON keys that happened to look like names. A field called "martin_score" is not PII.

German names are a special kind of pain. Compound surnames (Müller-Schmidt), umlauts, titles embedded in names (Dr. med. Hans-Peter von Braun). English-trained NER models miss half of these. My first pass caught 60% of German names. That’s a compliance failure, not a feature.

Audit trails matter more than detection rates. Regulators don’t ask “what’s your detection accuracy?” They ask “show me what was sent to OpenAI for this customer’s data on this date.” Every request needs logging — what was detected, what was replaced, what the AI saw. The boring part that makes or breaks compliance.

The On-Prem Question

Running your own models is the most credible alternative. No data leaves your infrastructure — not even pseudonymized. If your threat model demands that, it’s the right call.

But be honest about the cost. You’re looking at a GPU fleet, an ML ops team, months to production, and open-source model quality that doesn’t match frontier. When a client says “I need Claude Opus for this,” telling them to wait six months for an on-prem Llama deployment isn’t a serious answer. The proxy lets you ship this quarter with the model they actually asked for. Noirdoc runs as a managed service in Germany or self-hosted on your own infrastructure — pick whichever matches your risk appetite.

When to Use a Proxy (and When to Run Your Own)

Use a pseudonymization proxy ifRun on-premises models if
You need a specific frontier modelRegulations prohibit any external data transfer
Your team doesn’t have ML infrastructureYou have an existing GPU fleet and ML ops team
You need to ship compliance this quarterModel quality trade-off is acceptable
Your data includes European PIIYou need zero data exposure, even pseudonymized

The right compliance approach is the one that ships. For most companies using AI with EU customer data, that’s a proxy — not a six-month infrastructure project.


Noirdoc is drop-in pseudonymization for OpenAI, Anthropic, and Azure. Managed in Germany or self-hosted.


x|telegram

<--
previous
Germany Doesn't Have a Heroku — So I Built My Own on Hetzner
next
On-Device Transcription Without the Cloud — The Architecture Behind Murmur
-->