April 24, 20268 min read
Three-Layer Threat Detection Without a Subscription Feed
How heuristic, lexical, and semantic detection layers catch threats that aren't on any blocklist yet — using bundled threat banks built from public sources, not a commercial feed contract.
A reputation-based blocklist tells you whether a domain you're querying has been flagged as malicious by someone, somewhere, at some point in the past. It's a useful answer, but it's the wrong question.
The question that actually matters at threat-detection time is: given everything I can compute about this domain right now, how suspicious is it? That answer doesn't depend on whether the domain is on a list. It depends on what the domain looks like, what it's clustered with, who else has queried it, when it was registered, and what patterns of activity it's showing.
This post covers how Paloryx Resolver's three-layer detection actually works — and why it doesn't require a commercial threat-intel subscription to operate.
The blocklist gap
Reputation-based blocking catches the long tail of cataloged malicious infrastructure. Curated public threat-intelligence sources cover hundreds of thousands of known-bad domains, refresh continuously, and stop the boring bulk of threat traffic instantly. There's no reason not to use them.
But reputation lists have an inherent latency. A new phishing campaign registers a batch of look-alike domains, sends millions of emails, harvests credentials, and walks away — often before any list has cataloged the domains involved. Modern attack campaigns operate explicitly inside this window. If your only detection layer is a list, the window is your blind spot.
Three additional detection layers close most of that gap.
Layer 1: Heuristic signals
Some malicious domains have signatures that are computable from the domain name and its DNS behavior alone, without needing to have seen the domain before.
DGA (domain generation algorithm) detection. Malware uses DGAs to generate hundreds or thousands of candidate domains per day, register a small subset, and use them for command-and-control rendezvous. The generated domains have detectable signatures: high entropy, character distributions that don't match natural-language norms, n-gram frequencies that don't match the dictionary. A trigram-frequency model trained on a top-domain reference corpus is enough to flag DGA candidates with high confidence. xkfjwpqmvtsldh.biz lights up. marketing-newsletter-2026.biz doesn't.
Fast-flux infrastructure. Legitimate domains tend to point at a small set of stable IP addresses. Malicious infrastructure rotates IPs aggressively to evade IP-level blocking. A domain whose answers cycle through a fresh IP every minute, where the IPs come from many different ASNs across many geographies, has a fast-flux fingerprint that distinguishes it from CDN load-balancing.
Newly-registered domain (NRD) awareness. The vast majority of domains registered in any given week are either benign test domains or being used immediately for phishing, fraud, or malware delivery. Combining NRD age with other signals (lexical patterns, query volume, hosting characteristics) provides a strong leading indicator without false-positiving every legitimate new business.
Subdomain enumeration patterns. Some malware reconnaissance involves enumerating possible subdomains of a target organization. A device making rapid-fire queries for dev.target.example, test.target.example, staging.target.example, etc. shows a pattern that's almost never legitimate end-user behavior.
These signals are cheap to compute, run on the resolver itself, and require zero external lookups.
Layer 2: Lexical similarity
Phishing campaigns lean heavily on visual confusion. The user sees paypa1.com in the address bar and the brain auto-corrects to paypal.com. The actual difference is one character — a 1 substituted for an l — but the cognitive system doesn't catch it in the moment.
Standard edit-distance algorithms (Levenshtein, Damerau-Levenshtein) treat all character substitutions equally. That's wrong for this use case. The substitution l → 1 is a high-confidence phishing signal; the substitution l → q is just a typo. Confusable-character edit distance — where the substitution cost reflects visual similarity — catches the homoglyph attacks that standard edit distance misses.
Common confusables and their typical phishing usage:
| Looks like | Actually is | Phish example |
|---|---|---|
o | 0 | g00gle.com |
l | 1 | paypa1.com |
i | 1 | m1crosoft.com |
m | rn | arnazon.com |
o | o (Cyrillic, U+043E) | paypal.com (mixed-script) |
a | α (Greek alpha) | IDN homoglyph attacks |
Paloryx Resolver's lexical layer uses Damerau-Levenshtein with confusable-cost overlays, scored against a curated brand corpus (~500 high-value brands across banks, payment networks, big tech, government, e-commerce, and the rest of the standard phishing-target list). A query within edit-distance ≤1.0 to a brand fires the impersonation signal.
The brand corpus is curated, not feed-purchased. The list of 500 brands attackers consistently impersonate doesn't change daily. This is a one-time data construction problem, not a subscription.
Layer 3: Semantic similarity
Some impersonation patterns aren't lexically close to any specific brand. A domain like verify-banking-account-secure.example doesn't typo-match chase.com or bofa.com — but its word distribution is heavily clustered around banking and authentication patterns that legitimate bank domains rarely use. A trained embedding model can recognize this clustering.
The semantic layer uses a pre-trained sentence-transformer model (~33MB) that produces 384-dimensional embeddings of text. The model itself is pre-trained and frozen — we don't fine-tune it. Instead, the threat-detection capability comes from constructing high-quality reference banks of indicator embeddings:
- Anchor bank. Hundreds of hand-curated brand identifiers across two dozen categories (banks, payment networks, crypto exchanges, big tech, government, telecom, streaming, travel, food delivery, healthcare, AI services, registrar/hosting, security, plus the generic phishing-keyword set). Parent-walk resolution so subdomain-shaped impersonations resolve to their parent brand correctly.
- Malicious bank. Tens of thousands of indicators ingested daily from curated public threat-intelligence sources. Each entry encoded once at curation time; the resulting vector lives in the bank.
- Legit bank. ~1M indicators from a curated top-domain reference corpus. Provides a baseline for "looks like a normal-frequency legitimate domain" so we can distinguish "looks like nothing on the legit list either" (out-of-distribution / OOD) from "looks like a specific malicious cluster."
At query time, the resolver embeds the incoming query, computes cosine similarity against the three banks, and produces three signals: anchor similarity (impersonation), malicious similarity (known-bad shape), and OOD distance from the legit bank (novel suspicion).
The scoring fuses these into the same 0–100 scale as the other layers. The output is auditable: which bank matched, what the cosine similarity was, what brand or known-bad host was the closest match. Compliance officers and security analysts can trace any decision back to its supporting evidence.
Behavioral: C2 beacon detection
The fourth layer isn't a per-query score — it's a periodic background scan looking for behavioral patterns across the query stream. Command-and-control malware that's already on a device makes periodic callouts to receive instructions. The timing of those callouts is detectable.
A device making queries to a rare destination at 57-second intervals for eight straight hours, regardless of what the destination domain looks like, has a beacon fingerprint. Hourly background analysis identifies these patterns, scores them, and surfaces them in the dashboard with plain-language explanations: timing pattern, destination characteristics, device identity, suggested next action.
This catches threats that have already infected a device — the kind that any pre-connection blocklist or score is too late to stop. It's a fundamentally different class of detection, complementary to the per-query layers.
Why no commercial feed subscription
Most "AI-driven threat detection" products in this category depend on a commercial threat-intelligence subscription. The major commercial threat-intel vendors provide curated IOC feeds that the detection product consumes at runtime. The commercial value is real — those vendors invest heavily in primary research, fast-tracked indicator publication, and confidence labeling.
The cost is real too. Commercial feeds typically run high five figures to mid six figures per year and come with usage restrictions, redistribution prohibitions, and dependency relationships. For SMB or mid-market organizations, per-customer feed cost is often a non-starter.
The architecture we chose makes commercial feeds optional rather than required. The three-layer detection stack derives its capability from:
- Curated public threat-intelligence sources under licenses we've reviewed for commercial-product compatibility.
- Curated reference banks built from those sources via an offline pipeline. The customer never receives the raw feed; they receive numerical embedding vectors derived from it.
- A pre-trained open-source embedding model under a permissive license.
- Hand-curated brand corpus for the lexical and semantic anchor banks — built in-house.
None of these inputs require a per-customer subscription. The total ongoing cost is whatever it takes to maintain the offline curation pipeline — a few cloud VMs and engineering attention, not a per-customer license.
(For qualified buyers conducting vendor due diligence: the specific source list, our license-review notes, and the curation pipeline architecture are available under NDA.)
What that means for the customer: predictable pricing, no surprise renewal increases, no upstream-license risk, and no fourth-party data processor in your supply-chain due-diligence diagram.
What this isn't
This stack catches a substantial portion of threats that pure blocklists miss. It does not replace endpoint detection and response, network segmentation, identity controls, or any other layer of your posture. It catches a specific class of threats — DNS-resolvable ones, with detectable name-shape or behavioral signatures — extremely well. Other classes need other tools.
It's also not a magic detection rate. We're working on a measurement harness (an internal "bake-off rig") to publish hard numbers; until then, every detection-rate claim should be read with the caveat that we're informing rather than measuring.
What we can say with confidence: the architecture is honest. Detection runs in your binary. Threat intel ships in the installer. Paloryx Labs never sees, logs, or stores your DNS queries — detection happens entirely on your hardware. For the buyers we're serving, that combination matters more than a marginal improvement in caught-threat percentage from a flashier model.
Published by Paloryx Labs
Coming Soon — Join the List