sector/geo inference for enrich command

This commit is contained in:
Emil Lerch 2026-05-30 11:21:40 -07:00
parent d9fc17cf15
commit 85c9a48969
4 changed files with 910 additions and 11 deletions

View file

@ -32,7 +32,7 @@ repos:
- id: test
name: Run zig build test
entry: zig
args: ["build", "coverage", "-Dcoverage-threshold=72"]
args: ["build", "coverage", "-Dcoverage-threshold=74"]
language: system
types: [file]
pass_filenames: false

300
TODO.md
View file

@ -358,6 +358,42 @@ gain. Possible fixes are discussed in the "Contributions diff" TODO
below — option C there (per-account `cash_is_contribution`) would
make manually-entered ESPP-style cash additions count correctly.
## Audit: stale manual prices section is incorrect — priority HIGH
The `Stale manual prices` section in `zfin audit` (in
`src/commands/audit.zig` around line 1333) isn't computing the
right thing. The current logic walks `portfolio.lots`, filters to
lots with both `price` and `price_date` set, and flags any whose
`as_of.days - price_date.days > stale_days`. In practice this
either over-flags (counting lots that aren't really
manually-priced), under-flags (missing lots that ARE manually
priced but lack `price_date`), or both — needs investigation
against a real portfolio to determine which.
Things to check:
- Are we using the right field to identify "manually priced"? The
`Lot.price` field is set for any non-API price (manual override,
illiquid valuation, CD face, etc.); some of those shouldn't be
in a "stale prices" check (e.g. CDs with a fixed face value
aren't stale by age).
- Should the staleness comparison use `Allocation.is_manual_price`
(computed at the position level after the price-resolution
cascade) instead of the per-lot field? That captures "the price
this position is currently displaying came from a manual
source," which is what the user actually cares about.
- `price_date` falsely-null lots: if a lot has `price` set but no
`price_date`, we silently skip it instead of flagging it. That's
almost certainly wrong — a manually-priced lot with no recorded
date is the *most* stale case, not the least.
- Per-symbol vs per-lot: if the same symbol appears in multiple
lots with the same manual price, we currently emit one line per
lot. Probably wants to be one line per symbol with a count, or
at least dedup by `(symbol, price, price_date)`.
Fix should land with regression tests against a fixture portfolio
that exhibits each of the above shapes.
## Investigate: detailed 401(k) contributions data source
Found a more detailed contributions screen on at least one
@ -589,6 +625,40 @@ Verification: open the TUI, press `d` on projections, try to type a
date with the keypad. Then try the keyboard row. Both should commit
identical input.
## TUI: memory leaks somewhere — priority MEDIUM
User reported leaks while doing a detailed TUI walkthrough; no
specific tab or interaction yet identified. The TUI uses a mix
of arena allocators (frame-scoped) and persistent tab state, so
likely culprits:
- Per-tab `State` structs that hold `[]const u8` slices duped
from a long-lived allocator but not freed when the tab
reloads or the symbol changes.
- Cached service-fetch results stored in tab state that aren't
`result.deinit()`-ed before being replaced.
- ArrayList accumulators that get appended to across multiple
draw cycles without an intervening clear.
- Vaxis event/dialog closures that capture strings into
arena-allocated lambdas but escape the arena's lifetime.
### Investigation plan
1. Run the TUI under `std.testing.allocator` (a debug
allocator that panics on leak). The current binary uses a
gpa, which silently tolerates leaks. A test-mode TUI run
with a leak-detecting allocator would surface the
offending alloc sites with file/line info.
2. Walk each tab's `State.deinit` (and `tab.deactivate` /
`tab.reload` hooks) against the `State` field list — every
owned field needs a free path on every state-change boundary.
3. Pay specific attention to `classification_map` and any
per-symbol caches (option chains, candle snapshots) — those
are the biggest fixed-size strings.
No reproducer yet. When the user has a more specific lead
(which tab, which interaction), this entry should narrow.
## CLI dispatch / arg-parsing bugs (found May 2026)
Found during a post-framework-refactor sanity check of all 20
@ -627,6 +697,236 @@ Investigate by replacing the `print` with a manual `print` +
`flush` to see if it's a buffer-not-flushed issue, or by
serializing a known-good fixture in isolation.
## Analysis: dividend equity / income-shaped equity — think about it
Dividend-equity ETFs (SCHD, VYM, DGRO, NOBL, SDY, VIG, etc.)
bucket as Equity in `analysis.bucketSector`. That's correct for
risk-exposure analysis — they drop with the market in a
2008-style crash, regardless of the dividend stream — but it
loses the income-vs-growth distinction that retirement-planning
tools care about.
Open question: is there a useful second dimension to add?
Possibilities:
- **Yield-weighted breakdown.** Aggregate `current_yield` per
position, weight by market value, report a portfolio-level
yield. Doesn't change the asset-class taxonomy; adds a new
metric.
- **Income coverage of expenses.** "My dividends + bond coupons
cover X% of projected retirement spending." Closer to what the
income-side framing actually wants — answers the question
rather than redefining the buckets.
- **Income-equity sub-bucket within Equity.** A sub-row in the
Asset Category breakdown, not a 5th top-level bucket. Would
need a way to mark funds as "income-shaped" — probably a
per-symbol opt-in in `metadata.srf`.
Not a bug. Not blocking anything. Could end up being a feature.
This is a note to revisit after using the 4-bucket view for a
while and seeing whether the missing dimension actually matters
in practice.
Resist the temptation to:
- **Add a 5th top-level bucket** ("Income Equity" / "Dividend
Equity"). The 4-bucket view is already the right answer for
"how much equity exposure do I have?". A 5th bucket
fragments the headline number.
- **Override SCHD to Fixed Income.** Wrong on risk grounds.
SCHD will lose 35-45% in an equity crash; treating it as FI
makes the user think they have downside protection they don't.
- **Add per-symbol "intent" metadata** (`held_for_income::true`).
Smell of putting framing into data. Intent is a property of
the holder's strategy, not the security.
If a fix lands, it's probably a separate analysis section (yield
breakdown, income coverage) — not a change to the asset-class
taxonomy.
## Enrich: title-keyword classification inference for ETFs — priority MEDIUM
When Wikidata returns no entry for a fund symbol and we fall
through to the EDGAR ticker-map fallback, the auto-emitted
metadata line carries a generic `sector::Equity / Corporate,
geo::US,asset_class::Fund` triple. That's mechanically correct
(NPORT-P really does say "this fund holds equity in corporate
issuers, US-domiciled fund") but loses information the user
actually cares about: sector-themed ETFs (XLV → Healthcare),
geo-themed ETFs (FRDM → Emerging Markets, IDMO/HFXI/IVLU →
International Developed).
The fund's *title* often carries the answer unambiguously. We
already plumb `series_name` (NPORT-P `<seriesName>`, falling
back to the company-tickers `title`) through to
`emitMissingClassification`. Add a keyword-inference pass that
overrides the default sector and geo when the title contains
unambiguous keywords.
### Sector inference
Trigger: when the lookup is `.managed_fund` or `.company_or_uit`
AND the fund has a single dominant `Equity / Corporate` sector
(>95% of holdings), AND the title carries one of the keywords
below. Emit a single GICS-tagged line in place of the NPORT-P
breakdown.
Conservative keyword set (matches one GICS sector unambiguously):
- "Health Care" / "Healthcare" → Healthcare
- "Semiconductor" → Technology (not "Technology" alone — too generic)
- "Software" → Technology
- "Financial" → Financial Services (careful: "Financial Select Sector SPDR")
- "Energy" → Energy
- "Oil & Gas" / "Oil and Gas" → Energy
- "Real Estate" / "REIT" → Real Estate
- "Utilities" → Utilities
- "Consumer Discretionary" → Consumer Cyclical
- "Consumer Staples" → Consumer Defensive
- "Industrial" → Industrials (careful: "Industrial Materials" — match
whole phrase)
- "Materials" → Basic Materials
- "Communication" / "Telecom" → Communication Services
Reuse `Wikidata.canonicalizeSector`'s sector constants so the
two taxonomies don't drift.
The "single dominant Equity / Corporate" guard prevents the
inference from misclassifying multi-asset funds (FAGIX-shape:
Debt + Equity + Loan), pure-debt funds (VBTLX), or sector-fund
edge cases like a hypothetical "Vanguard Healthcare Income Fund"
(if the breakdown is multi-sleeve, leave the NPORT-P decomposition
alone).
### Geo inference
Trigger: when the lookup is `.managed_fund` or `.company_or_uit`
AND the title carries an unambiguous geo keyword. Override the
default `geo::US` to the inferred bucket.
This one is more important than sector inference because the
default is *factually wrong* for international/emerging funds,
not just imprecise. FRDM holds Taiwanese, South Korean,
Chilean, Polish equities; tagging it `geo::US` overstates US
exposure and understates EM exposure proportionally to the
fund's weight in the portfolio.
Conservative keyword set:
- "Emerging Markets" / "Emerging Market" → Emerging Markets
- "Frontier Markets" → Emerging Markets (or own bucket if added)
- "International Developed" → International Developed
- "International" / "Intl" / "Intl." → International Developed
(careful: only when not paired with "+ US" or similar mixing
modifier)
- (Skip country-specific keywords for now — "China" / "Japan" /
"Europe" are unambiguous but we'd be designing per-country
buckets that don't exist in the current taxonomy)
False-positive risk for "International": fund names like
"Vanguard Total International + US Equity Index" would mis-tag
as International. Audit your portfolio's titles before locking
in the keyword. The conservative version may need to be
"International" only when the title contains no US-related
keyword, or might need explicit phrase matching.
### Tests
- `inferSectorFromTitle("State Street(R) Health Care Select Sector SPDR(R) ETF")` → "Healthcare"
- `inferSectorFromTitle("iShares Semiconductor ETF")` → "Technology"
- `inferSectorFromTitle("Schwab U.S. Dividend Equity ETF")` → null (broad-market, no sector word)
- `inferSectorFromTitle("Vanguard Total Bond Market Index Fund")` → null
- `inferGeoFromTitle("Freedom 100 Emerging Markets ETF")` → "Emerging Markets"
- `inferGeoFromTitle("iShares MSCI Intl Value Factor ETF")` → "International Developed"
- `inferGeoFromTitle("Schwab U.S. Dividend Equity ETF")` → null
- Plus integration tests against `emitMissingClassification` confirming the override only fires when the dominant-sector / single-geo guards are satisfied.
### User's portfolio coverage
After this work, the funds in the user's metadata.srf that
currently need hand-editing for sector/geo would be auto-tagged:
- Sector: XLV (Healthcare), SOXX (Technology), QTUM (Technology — "Quantum" is borderline; might require explicit add)
- Geo: FRDM (Emerging Markets), IDMO (International Developed), HFXI (International Developed), IVLU (International Developed)
## Analysis: collapse fine-grained NPORT-P sector strings at display time — priority MEDIUM
The Sector (Equities) section in `analysis` output currently
shows raw NPORT-P sector strings. For a portfolio with
multi-asset funds (FAGIX, VBTLX, PTY) this means six different
"Debt / *" rows (Debt / Corporate, Debt / US Treasury,
Debt / Municipal, Debt / Non-US Sovereign, Debt / US Gov Agency,
Debt / US GSE), three "Asset-Backed / *" rows, three
"Derivative / *" rows, etc. — too granular to scan.
The user's framing: "sometimes I'd be interested in 'roll up
all my debt investments to a single bucket', sometimes I'd want
to see split between federal government, munis and corporate."
That argues for **multiple display granularity levels** with a
TUI hot-key to toggle, not a one-time collapse decision.
### Design
Three display granularity tiers:
1. **Coarse** (4 buckets): Equity / Fixed Income / Cash / Other.
Already implemented as the Asset Category section. Could be
a granularity option for the Sector section too.
2. **Mid** (~12-16 buckets): collapses NPORT-P sub-flavors but
keeps GICS sectors distinct. Roughly:
- "Bonds" (collapses all `Debt / *` + `Loan / *`)
- "Asset-Backed Securities" (collapses all `Asset-Backed / *`)
- "Cash & Equivalents" (collapses STIV variants + Repurchase Agreement)
- "Equity / Corporate" (the dominant equity bucket)
- "Equity / Other" (small equity sleeves)
- "Derivatives & Other" (collapses Derivative / Derivative-FX / Direct Real Property / etc.)
- The 11 GICS sectors (Technology, Healthcare, etc.) for stock-level entries
3. **Fine** (current behavior): raw NPORT-P strings — Debt /
US Treasury vs Debt / Municipal vs Debt / Non-US Sovereign,
etc.
User toggles between tiers. Default: probably Mid.
### Implementation
Build a pure mapping function `collapseSector(sector, granularity)
[]const u8` parallel to `bucketSector`. Display layer chooses
granularity. Aggregation can either:
- **(a)** Run all three aggregations every time and pick at
display. Memory cost ~3x for the sector breakdown but the
data is small (dozens of rows).
- **(b)** Re-aggregate when granularity changes. Cheaper memory,
costs a single pass over the classifications on toggle. TUI
toggle latency is fine — it's a hashmap rebuild over <50 rows.
Option (b) is probably right for the TUI. CLI can pick one
granularity at command-line time (default Mid; `--sector-detail
fine|mid|coarse` to override).
### Dependency
**Lands AFTER the title-keyword inference work above**, so the
collapse logic is designed against the post-inference content
shape (where XLV is `Healthcare` rather than `Equity / Corporate`,
FRDM is `Equity / Corporate` + `geo::Emerging Markets`, etc.)
rather than today's pre-inference shape.
### TUI integration
Hot-key cycles between coarse / mid / fine on the analysis tab.
Status bar shows current granularity. State persists across
re-renders within a session; no need to persist across sessions.
### Tests
- `collapseSector("Debt / US Treasury", .mid)` → "Bonds"
- `collapseSector("Debt / US Treasury", .fine)` → "Debt / US Treasury"
- `collapseSector("Debt / US Treasury", .coarse)` → "Fixed Income" (delegates to bucketSector)
- `collapseSector("Technology", .mid)` → "Technology" (GICS sectors stay distinct at mid)
- `collapseSector("Technology", .coarse)` → "Equity"
- TUI hot-key cycles through three granularities and updates display.
## Analysis: umbrella-insurance exposure indicator — priority MEDIUM
In the `analysis` command and TUI tab, surface how much of the

View file

@ -414,6 +414,187 @@ pub const FundSector = struct {
pct: f64,
};
// Title-keyword inference
//
// When the EDGAR fallback fires (Wikidata had no entry), the
// fund's title pulled from NPORT-P `<seriesName>` or the
// company-tickers `title` often carries enough signal to
// infer a useful sector or geo override. The default
// `sector::Equity / Corporate, geo::US, asset_class::Fund`
// triple is mechanically correct (NPORT-P really says the fund
// holds equity in corporate issuers, US-domiciled fund) but
// loses the specific GICS sector or international/emerging-market
// exposure that actually matters for portfolio analysis.
//
// Both inference functions are pure-data and tested directly.
/// Returns true if `haystack` contains any of `needles` as a
/// substring. Case-sensitive caller pre-lowercases when
/// case-insensitive matching is wanted.
fn titleContainsAny(haystack: []const u8, needles: []const []const u8) bool {
for (needles) |needle| {
if (std.mem.indexOf(u8, haystack, needle) != null) return true;
}
return false;
}
/// Lowercase the title into a stack buffer for case-insensitive
/// keyword matching. Truncates titles longer than the buffer
/// (returns null) real fund names easily fit in 256 bytes.
fn lowercaseTitle(buf: []u8, title: []const u8) ?[]const u8 {
if (title.len > buf.len) return null;
return std.ascii.lowerString(buf[0..title.len], title);
}
/// Infer a GICS sector from a fund's title. Returns null when
/// no unambiguous keyword match caller falls back to whatever
/// sector data NPORT-P provided (typically `Equity / Corporate`).
///
/// Conservative keyword set: matches only words that map
/// unambiguously to a single GICS sector. "Income" / "Dividend"
/// / "Value" / "Growth" / "Momentum" / "Total" / "Equal Weight"
/// / "International" / "Emerging" don't appear here they
/// describe the screening methodology or geo, not the sector.
///
/// Reuses `zfin.classification.sector` constants so the
/// inference taxonomy stays in lock-step with the canonicalizer.
pub fn inferSectorFromTitle(title: ?[]const u8) ?[]const u8 {
const t = title orelse return null;
if (t.len == 0) return null;
var buf: [256]u8 = undefined;
const lc = lowercaseTitle(&buf, t) orelse return null;
// Order matters: more-specific keywords come first within
// each sector. "Health care" before "care" (irrelevant
// example), "semiconductor" before generic "tech" (which we
// don't include too broad).
// Healthcare. "Health care" with space (XLV title), "healthcare"
// (one word), "biotech", "pharmaceutical".
if (titleContainsAny(lc, &.{ "health care", "healthcare", "biotech", "pharmaceutical", "medical" })) {
return zfin.classification.sector.healthcare;
}
// Technology. Specific terms only "tech" alone is too
// broad (matches "biotech", "fintech", "edtech" all
// sector-mixing).
if (titleContainsAny(lc, &.{ "semiconductor", "software", "cloud computing", "internet" })) {
return zfin.classification.sector.technology;
}
// Financial Services. "Financial" is fairly specific in
// fund-name conventions ("Financial Select Sector SPDR",
// "Vanguard Financials ETF").
if (titleContainsAny(lc, &.{ "financial", "bank" })) {
return zfin.classification.sector.financial_services;
}
// Energy. "Energy" alone is mostly unambiguous in fund
// conventions; pair with "oil" / "gas" for redundancy.
if (titleContainsAny(lc, &.{ "energy", "oil & gas", "oil and gas", "petroleum" })) {
return zfin.classification.sector.energy;
}
// Real Estate / REITs.
if (titleContainsAny(lc, &.{ "real estate", "reit" })) {
return zfin.classification.sector.real_estate;
}
// Utilities. "Utilities" alone is unambiguous.
if (titleContainsAny(lc, &.{"utilities"})) {
return zfin.classification.sector.utilities;
}
// Consumer Discretionary / Cyclical. Match the explicit
// labels "consumer" alone is ambiguous (could be
// discretionary or staples).
if (titleContainsAny(lc, &.{ "consumer discretionary", "consumer cyclical" })) {
return zfin.classification.sector.consumer_cyclical;
}
// Consumer Staples / Defensive.
if (titleContainsAny(lc, &.{ "consumer staples", "consumer defensive" })) {
return zfin.classification.sector.consumer_defensive;
}
// Industrials. "Industrial" is more reliable than
// "industrials" because some fund names use the singular
// ("Industrial Select Sector SPDR").
if (titleContainsAny(lc, &.{ "industrial", "aerospace", "defense" })) {
return zfin.classification.sector.industrials;
}
// Basic Materials.
if (titleContainsAny(lc, &.{ "materials", "mining", "miners", "metals" })) {
return zfin.classification.sector.basic_materials;
}
// Communication Services. "Communication" / "Telecom"
// unambiguous.
if (titleContainsAny(lc, &.{ "communication", "telecom", "media" })) {
return zfin.classification.sector.communication_services;
}
return null;
}
/// Infer a geo bucket from a fund's title. Returns null when
/// the title doesn't carry an unambiguous international/emerging
/// keyword caller keeps the default `geo::US`.
///
/// More important than sector inference: the default `geo::US`
/// is *factually wrong* for international funds (FRDM holds
/// emerging-market equities, not US), so this fix tightens
/// portfolio-level geographic-exposure reporting.
pub fn inferGeoFromTitle(title: ?[]const u8) ?[]const u8 {
const t = title orelse return null;
if (t.len == 0) return null;
var buf: [256]u8 = undefined;
const lc = lowercaseTitle(&buf, t) orelse return null;
// Emerging markets first most specific. "Emerging" alone
// is rare in non-EM contexts in fund-name conventions.
// "Frontier" likewise is conventionally only used for
// frontier markets in fund titles.
if (titleContainsAny(lc, &.{ "emerging market", "emerging markets", "frontier market", "frontier markets", "frontier" })) {
return zfin.classification.geo.emerging;
}
// International Developed. "International" / "Intl" /
// "ex-US" / "World ex US" / "Developed Markets" /
// specific developed-market regions.
//
// False-positive risk: a hypothetical "Vanguard Total
// International + US Equity Fund" would mis-tag here. None
// of the user's current portfolio holds such a hybrid
// fund. If one ever shows up, it'll get flagged in the
// diff-against-old-metadata.srf review and can be
// hand-corrected.
if (titleContainsAny(lc, &.{ "international", " intl", "ex-us", "ex us", "world ex", "developed market", "developed markets" })) {
return zfin.classification.geo.developed;
}
return null;
}
/// Determine whether a fund's NPORT-P breakdown is dominated
/// by a single Equity / Corporate sector the precondition
/// for sector inference firing. A "dominant" sector is one
/// that's >95% of the holdings; multi-asset funds (FAGIX-shape:
/// 48% Debt + 22% Equity + ...) don't meet this guard and
/// keep their NPORT-P decomposition.
fn hasDominantEquitySector(fund_sectors: ?[]const FundSector) bool {
const sectors = fund_sectors orelse return false;
for (sectors) |s| {
if (std.mem.eql(u8, s.description, "Equity / Corporate") and s.pct >= 95.0) {
return true;
}
}
return false;
}
/// Wikidata didn't return a classification for `sym` (either the
/// fetch errored out softly, or returned an empty result set).
/// Emit a metadata line based on the EDGAR-fallback `lookup`:
@ -448,7 +629,11 @@ fn emitMissingClassification(
} else {
try out.print("# {s} -- EDGAR managed fund (Wikidata had no entry)\n", .{sym});
}
try emitFundLines(sym, "Fund", fund_sectors, out);
// Title-keyword inference: try the series_name (which
// is the only title source on the managed-fund path).
const inferred_sector = inferSectorFromTitle(series_name);
const inferred_geo = inferGeoFromTitle(series_name);
try emitFundLines(sym, "Fund", fund_sectors, inferred_sector, inferred_geo, out);
},
.company_or_uit => |c| {
const asset_class = if (c.is_etf) "ETF" else "Fund";
@ -463,7 +648,12 @@ fn emitMissingClassification(
} else {
try out.print("# {s} -- EDGAR company-map entry (Wikidata had no entry)\n", .{sym});
}
try emitFundLines(sym, asset_class, fund_sectors, out);
// Title-keyword inference: prefer series_name (more
// authoritative), fall back to company-tickers title.
const effective_title: ?[]const u8 = series_name orelse c.title;
const inferred_sector = inferSectorFromTitle(effective_title);
const inferred_geo = inferGeoFromTitle(effective_title);
try emitFundLines(sym, asset_class, fund_sectors, inferred_sector, inferred_geo, out);
},
.none => {
if (err) |e| {
@ -481,24 +671,58 @@ fn emitMissingClassification(
/// `pct:num:N` line per sector; otherwise emits a single
/// `sector::TODO` line. The asset_class comes from the caller
/// (already determined: "Fund" or "ETF").
///
/// `inferred_sector` (when non-null AND a single dominant
/// `Equity / Corporate` sleeve exists) replaces that sleeve's
/// row with a GICS-tagged row, preserving the original pct.
/// Other rows (Cash sleeves, dust derivatives) stay as-is.
/// When inference doesn't apply (no dominant sleeve, no
/// inferred sector, or the breakdown is multi-asset like
/// FAGIX), the NPORT-P decomposition emits unchanged.
///
/// `geo` is applied uniformly to every emitted row. Defaults
/// to `"US"` when null. NPORT-P doesn't tell us the holdings'
/// geo (only the fund's domicile, which is always US for funds
/// in this map), so the caller passes the inferred geo when
/// available.
fn emitFundLines(
sym: []const u8,
asset_class: []const u8,
fund_sectors: ?[]const FundSector,
inferred_sector: ?[]const u8,
geo: ?[]const u8,
out: *std.Io.Writer,
) !void {
const geo_str = geo orelse "US";
if (fund_sectors) |sectors| {
if (sectors.len > 0) {
const should_override =
inferred_sector != null and
hasDominantEquitySector(sectors);
for (sectors) |s| {
// When inference fires, replace the dominant
// Equity / Corporate row with the inferred GICS
// sector. Other rows stay as the raw NPORT-P
// category they're informative as-is (Cash
// sleeves, derivatives, etc.).
const sector_str = if (should_override and
std.mem.eql(u8, s.description, "Equity / Corporate"))
inferred_sector.?
else
s.description;
try out.print(
"symbol::{s},sector::{s},geo::US,asset_class::{s},pct:num:{d:.2}\n",
.{ sym, s.description, asset_class, s.pct },
"symbol::{s},sector::{s},geo::{s},asset_class::{s},pct:num:{d:.2}\n",
.{ sym, sector_str, geo_str, asset_class, s.pct },
);
}
return;
}
}
try out.print("symbol::{s},sector::TODO,geo::US,asset_class::{s}\n", .{ sym, asset_class });
// No sector breakdown at all (NPORT-P fetch failed). Emit
// one TODO line but if title-keyword inference returned
// a sector, use it instead of "TODO".
const sector_str = inferred_sector orelse "TODO";
try out.print("symbol::{s},sector::{s},geo::{s},asset_class::{s}\n", .{ sym, sector_str, geo_str, asset_class });
}
/// What `getEtfMetrics` provides that `enrich` actually uses:
@ -1565,10 +1789,361 @@ test "emitMissingClassification: .none ignores series_name (no fund name to disp
try std.testing.expect(std.mem.indexOf(u8, written, "Spurious Name") == null);
}
// inferSectorFromTitle
test "inferSectorFromTitle: Health Care / Healthcare → Healthcare" {
try std.testing.expectEqualStrings(
"Healthcare",
inferSectorFromTitle("State Street(R) Health Care Select Sector SPDR(R) ETF").?,
);
try std.testing.expectEqualStrings("Healthcare", inferSectorFromTitle("iShares U.S. Healthcare ETF").?);
try std.testing.expectEqualStrings("Healthcare", inferSectorFromTitle("Vanguard Health Care ETF").?);
try std.testing.expectEqualStrings("Healthcare", inferSectorFromTitle("SPDR S&P Pharmaceutical ETF").?);
try std.testing.expectEqualStrings("Healthcare", inferSectorFromTitle("iShares Biotech ETF").?);
}
test "inferSectorFromTitle: Semiconductor / Software → Technology" {
try std.testing.expectEqualStrings("Technology", inferSectorFromTitle("iShares Semiconductor ETF").?);
try std.testing.expectEqualStrings("Technology", inferSectorFromTitle("VanEck Semiconductor ETF").?);
try std.testing.expectEqualStrings("Technology", inferSectorFromTitle("Invesco Software ETF").?);
try std.testing.expectEqualStrings("Technology", inferSectorFromTitle("First Trust Cloud Computing ETF").?);
}
test "inferSectorFromTitle: Financial → Financial Services" {
try std.testing.expectEqualStrings("Financial Services", inferSectorFromTitle("Financial Select Sector SPDR Fund").?);
try std.testing.expectEqualStrings("Financial Services", inferSectorFromTitle("Vanguard Financials ETF").?);
try std.testing.expectEqualStrings("Financial Services", inferSectorFromTitle("SPDR S&P Bank ETF").?);
}
test "inferSectorFromTitle: Energy → Energy" {
try std.testing.expectEqualStrings("Energy", inferSectorFromTitle("Energy Select Sector SPDR Fund").?);
try std.testing.expectEqualStrings("Energy", inferSectorFromTitle("SPDR S&P Oil & Gas Exploration ETF").?);
try std.testing.expectEqualStrings("Energy", inferSectorFromTitle("VanEck Oil and Gas ETF").?);
try std.testing.expectEqualStrings("Energy", inferSectorFromTitle("Invesco Petroleum ETF").?);
}
test "inferSectorFromTitle: Real Estate / REIT → Real Estate" {
try std.testing.expectEqualStrings("Real Estate", inferSectorFromTitle("Vanguard Real Estate ETF").?);
try std.testing.expectEqualStrings("Real Estate", inferSectorFromTitle("Schwab U.S. REIT ETF").?);
}
test "inferSectorFromTitle: Utilities → Utilities" {
try std.testing.expectEqualStrings("Utilities", inferSectorFromTitle("Utilities Select Sector SPDR Fund").?);
try std.testing.expectEqualStrings("Utilities", inferSectorFromTitle("Vanguard Utilities ETF").?);
}
test "inferSectorFromTitle: Industrials and Materials" {
try std.testing.expectEqualStrings("Industrials", inferSectorFromTitle("Industrial Select Sector SPDR Fund").?);
try std.testing.expectEqualStrings("Industrials", inferSectorFromTitle("iShares Aerospace & Defense ETF").?);
try std.testing.expectEqualStrings("Basic Materials", inferSectorFromTitle("Materials Select Sector SPDR Fund").?);
try std.testing.expectEqualStrings("Basic Materials", inferSectorFromTitle("VanEck Gold Miners ETF").?);
}
test "inferSectorFromTitle: Communication Services" {
try std.testing.expectEqualStrings("Communication Services", inferSectorFromTitle("Communication Services Select Sector SPDR Fund").?);
try std.testing.expectEqualStrings("Communication Services", inferSectorFromTitle("iShares U.S. Telecom ETF").?);
}
test "inferSectorFromTitle: Consumer Discretionary / Cyclical / Staples / Defensive" {
try std.testing.expectEqualStrings("Consumer Cyclical", inferSectorFromTitle("Consumer Discretionary Select Sector SPDR Fund").?);
try std.testing.expectEqualStrings("Consumer Cyclical", inferSectorFromTitle("iShares U.S. Consumer Cyclical ETF").?);
try std.testing.expectEqualStrings("Consumer Defensive", inferSectorFromTitle("Consumer Staples Select Sector SPDR Fund").?);
try std.testing.expectEqualStrings("Consumer Defensive", inferSectorFromTitle("Vanguard Consumer Defensive ETF").?);
}
test "inferSectorFromTitle: broad-market / strategy funds return null" {
// No sector keyword falls through. Caller keeps the
// NPORT-P generic Equity / Corporate.
try std.testing.expect(inferSectorFromTitle("SPDR S&P 500 ETF Trust") == null);
try std.testing.expect(inferSectorFromTitle("Vanguard Total Stock Market ETF") == null);
try std.testing.expect(inferSectorFromTitle("Schwab U.S. Dividend Equity ETF") == null);
try std.testing.expect(inferSectorFromTitle("Invesco S&P 500 Equal Weight ETF") == null);
try std.testing.expect(inferSectorFromTitle("Vanguard Total Bond Market Index Fund") == null);
try std.testing.expect(inferSectorFromTitle("Fidelity Capital and Income Fund") == null);
}
test "inferSectorFromTitle: target-date funds return null (multi-sector)" {
// Target-date funds hold a mix of equity and bonds across
// multiple sectors. No keyword should match.
try std.testing.expect(inferSectorFromTitle("VANGUARD TARGET RETIREMENT 2040 FUND") == null);
try std.testing.expect(inferSectorFromTitle("Fidelity Freedom 2050 Fund") == null);
}
test "inferSectorFromTitle: null and empty input return null" {
try std.testing.expect(inferSectorFromTitle(null) == null);
try std.testing.expect(inferSectorFromTitle("") == null);
}
test "inferSectorFromTitle: oversized title returns null safely" {
// Buffer-bounded; titles >256 bytes return null rather
// than crash. Real fund names are far shorter; this is a
// defensive check.
const long_title = "X" ** 300;
try std.testing.expect(inferSectorFromTitle(long_title) == null);
}
test "inferSectorFromTitle: case-insensitive matching" {
// "HEALTH CARE" (all caps, e.g. Vanguard's all-caps style)
// matches the same as "Health Care".
try std.testing.expectEqualStrings("Healthcare", inferSectorFromTitle("VANGUARD HEALTH CARE ETF").?);
try std.testing.expectEqualStrings("Technology", inferSectorFromTitle("ISHARES SEMICONDUCTOR ETF").?);
}
test "inferSectorFromTitle: returns same pointer for same bucket (static literal)" {
// The taxonomy constants are static literals; multiple
// calls returning the same bucket should hand out the
// same byte pointer. Lets callers compare via ptr equality
// and use the result as a stable HashMap key.
const a = inferSectorFromTitle("Vanguard Health Care ETF").?;
const b = inferSectorFromTitle("iShares U.S. Healthcare ETF").?;
try std.testing.expectEqual(@intFromPtr(a.ptr), @intFromPtr(b.ptr));
}
// inferGeoFromTitle
test "inferGeoFromTitle: Emerging Markets → Emerging Markets" {
try std.testing.expectEqualStrings("Emerging Markets", inferGeoFromTitle("Freedom 100 Emerging Markets ETF").?);
try std.testing.expectEqualStrings("Emerging Markets", inferGeoFromTitle("Vanguard FTSE Emerging Markets ETF").?);
try std.testing.expectEqualStrings("Emerging Markets", inferGeoFromTitle("iShares MSCI Emerging Markets ETF").?);
}
test "inferGeoFromTitle: Frontier Markets → Emerging Markets bucket" {
try std.testing.expectEqualStrings("Emerging Markets", inferGeoFromTitle("iShares MSCI Frontier 100 ETF").?);
try std.testing.expectEqualStrings("Emerging Markets", inferGeoFromTitle("Vanguard Frontier Markets ETF").?);
}
test "inferGeoFromTitle: International → International Developed" {
try std.testing.expectEqualStrings("International Developed", inferGeoFromTitle("iShares MSCI Intl Value Factor ETF").?);
try std.testing.expectEqualStrings("International Developed", inferGeoFromTitle("Vanguard FTSE Developed Markets ETF").?);
try std.testing.expectEqualStrings("International Developed", inferGeoFromTitle("Invesco S&P International Developed Momentum ETF").?);
try std.testing.expectEqualStrings("International Developed", inferGeoFromTitle("NYLI FTSE International Equity Currency Neutral ETF").?);
try std.testing.expectEqualStrings("International Developed", inferGeoFromTitle("Vanguard FTSE All-World ex-US ETF").?);
}
test "inferGeoFromTitle: US-only / no geo keyword returns null" {
try std.testing.expect(inferGeoFromTitle("SPDR S&P 500 ETF Trust") == null);
try std.testing.expect(inferGeoFromTitle("Vanguard Total Stock Market ETF") == null);
try std.testing.expect(inferGeoFromTitle("Schwab U.S. Dividend Equity ETF") == null);
try std.testing.expect(inferGeoFromTitle("Fidelity Capital and Income Fund") == null);
}
test "inferGeoFromTitle: Emerging beats International when both present" {
// Defensive: "iShares MSCI International Emerging Markets
// ETF" (hypothetical) would match both branches. Emerging
// Markets is more specific and is checked first; verify
// the priority order holds.
try std.testing.expectEqualStrings(
"Emerging Markets",
inferGeoFromTitle("iShares MSCI International Emerging Markets ETF").?,
);
}
test "inferGeoFromTitle: null and empty input return null" {
try std.testing.expect(inferGeoFromTitle(null) == null);
try std.testing.expect(inferGeoFromTitle("") == null);
}
test "inferGeoFromTitle: case-insensitive matching" {
try std.testing.expectEqualStrings("Emerging Markets", inferGeoFromTitle("FREEDOM 100 EMERGING MARKETS ETF").?);
try std.testing.expectEqualStrings("International Developed", inferGeoFromTitle("ISHARES MSCI INTL VALUE FACTOR ETF").?);
}
test "inferGeoFromTitle: returns static-literal pointers" {
const a = inferGeoFromTitle("iShares Emerging Markets ETF").?;
const b = inferGeoFromTitle("Vanguard Emerging Markets ETF").?;
try std.testing.expectEqual(@intFromPtr(a.ptr), @intFromPtr(b.ptr));
}
// hasDominantEquitySector
test "hasDominantEquitySector: single 99% Equity / Corporate -> true" {
const sectors = [_]FundSector{
.{ .description = "Equity / Corporate", .pct = 99.7 },
.{ .description = "Short-Term Investment Vehicle / Registered Fund", .pct = 0.19 },
};
try std.testing.expect(hasDominantEquitySector(sectors[0..]));
}
test "hasDominantEquitySector: 95% threshold is inclusive" {
const sectors = [_]FundSector{
.{ .description = "Equity / Corporate", .pct = 95.0 },
};
try std.testing.expect(hasDominantEquitySector(sectors[0..]));
}
test "hasDominantEquitySector: 94.99% does NOT trigger" {
const sectors = [_]FundSector{
.{ .description = "Equity / Corporate", .pct = 94.99 },
};
try std.testing.expect(!hasDominantEquitySector(sectors[0..]));
}
test "hasDominantEquitySector: multi-asset fund (FAGIX-shape) -> false" {
// FAGIX has 22% Equity / Corporate plus debt and other
// sleeves. 22% is way under the 95% threshold.
const sectors = [_]FundSector{
.{ .description = "Debt / Corporate", .pct = 47.69 },
.{ .description = "Equity / Corporate", .pct = 22.49 },
.{ .description = "Loan / Corporate", .pct = 9.99 },
};
try std.testing.expect(!hasDominantEquitySector(sectors[0..]));
}
test "hasDominantEquitySector: pure-debt fund -> false" {
// VBTLX-shape: all Debt / *. No Equity / Corporate row at all.
const sectors = [_]FundSector{
.{ .description = "Debt / Corporate", .pct = 50.0 },
.{ .description = "Debt / US Treasury", .pct = 30.0 },
};
try std.testing.expect(!hasDominantEquitySector(sectors[0..]));
}
test "hasDominantEquitySector: null and empty -> false" {
try std.testing.expect(!hasDominantEquitySector(null));
const empty = [_]FundSector{};
try std.testing.expect(!hasDominantEquitySector(empty[0..]));
}
// inference integration with emitMissingClassification
test "emitMissingClassification: XLV-shape applies sector inference (Health Care -> Healthcare)" {
// Single dominant Equity / Corporate (99.76%) AND title
// contains "Health Care" the Equity row gets replaced
// with Healthcare. Cash sleeve stays as STIV.
var out_buf: [1024]u8 = undefined;
var out: std.Io.Writer = .fixed(&out_buf);
const sectors = [_]FundSector{
.{ .description = "Equity / Corporate", .pct = 99.76 },
.{ .description = "Short-Term Investment Vehicle / Registered Fund", .pct = 0.45 },
};
try emitMissingClassification(
"XLV",
.{ .company_or_uit = .{ .title = "SPDR HEALTH CARE SELECT SECTOR ETF", .is_etf = true } },
sectors[0..],
"State Street(R) Health Care Select Sector SPDR(R) ETF",
null,
&out,
);
const written = out.buffered();
// The dominant row gets the GICS sector. Note pct preserved.
try std.testing.expect(std.mem.indexOf(u8, written, "symbol::XLV,sector::Healthcare,geo::US,asset_class::ETF,pct:num:99.76") != null);
// Cash sleeve unchanged.
try std.testing.expect(std.mem.indexOf(u8, written, "Short-Term Investment Vehicle / Registered Fund") != null);
// The generic Equity / Corporate row should NOT appear.
try std.testing.expect(std.mem.indexOf(u8, written, "symbol::XLV,sector::Equity / Corporate") == null);
}
test "emitMissingClassification: FRDM-shape applies geo inference (Emerging Markets)" {
// Title "Emerging Markets" every row gets geo::Emerging Markets.
// No sector inference (no sector keyword in title).
var out_buf: [1024]u8 = undefined;
var out: std.Io.Writer = .fixed(&out_buf);
const sectors = [_]FundSector{
.{ .description = "Equity / Corporate", .pct = 99.55 },
};
try emitMissingClassification(
"FRDM",
.managed_fund,
sectors[0..],
"Freedom 100 Emerging Markets ETF",
null,
&out,
);
const written = out.buffered();
try std.testing.expect(std.mem.indexOf(u8, written, "geo::Emerging Markets") != null);
// No US geo on this row.
try std.testing.expect(std.mem.indexOf(u8, written, "symbol::FRDM,sector::Equity / Corporate,geo::US") == null);
}
test "emitMissingClassification: multi-asset fund (FAGIX-shape) does NOT apply sector inference" {
// Multi-asset breakdown no dominant Equity / Corporate
// sleeve. Sector inference should NOT fire even if the
// title had a sector keyword (FAGIX's title doesn't, but
// this guards the multi-asset case generally).
var out_buf: [1024]u8 = undefined;
var out: std.Io.Writer = .fixed(&out_buf);
const sectors = [_]FundSector{
.{ .description = "Debt / Corporate", .pct = 47.69 },
.{ .description = "Equity / Corporate", .pct = 22.49 },
.{ .description = "Loan / Corporate", .pct = 9.99 },
};
try emitMissingClassification(
"FAKE",
.managed_fund,
sectors[0..],
// Hypothetical title with a sector keyword the
// inference would normally pick up.
"Hypothetical Healthcare Multi-Asset Fund",
null,
&out,
);
const written = out.buffered();
// Inference should NOT have fired the Equity / Corporate
// row stays raw. The fund's *comment line* contains the
// word "Healthcare" because the title does, but no
// `sector::Healthcare` row should appear.
try std.testing.expect(std.mem.indexOf(u8, written, "sector::Healthcare") == null);
try std.testing.expect(std.mem.indexOf(u8, written, "symbol::FAKE,sector::Equity / Corporate") != null);
// Other sleeves unchanged.
try std.testing.expect(std.mem.indexOf(u8, written, "symbol::FAKE,sector::Debt / Corporate") != null);
try std.testing.expect(std.mem.indexOf(u8, written, "symbol::FAKE,sector::Loan / Corporate") != null);
}
test "emitMissingClassification: SCHD-shape (no sector keyword) keeps NPORT-P breakdown" {
// SCHD has dominant Equity / Corporate but no sector
// keyword in its title inference returns null and the
// raw NPORT-P row stays. (User can hand-edit if they want
// a different label.)
var out_buf: [1024]u8 = undefined;
var out: std.Io.Writer = .fixed(&out_buf);
const sectors = [_]FundSector{
.{ .description = "Equity / Corporate", .pct = 99.70 },
.{ .description = "Short-Term Investment Vehicle / Registered Fund", .pct = 0.19 },
};
try emitMissingClassification(
"SCHD",
.managed_fund,
sectors[0..],
"Schwab U.S. Dividend Equity ETF",
null,
&out,
);
const written = out.buffered();
// No sector inference: keep the raw row.
try std.testing.expect(std.mem.indexOf(u8, written, "symbol::SCHD,sector::Equity / Corporate") != null);
}
test "emitMissingClassification: combined sector + geo inference" {
// Hypothetical "iShares MSCI Healthcare Emerging Markets
// ETF" — both keywords fire. Healthcare overrides the
// Equity / Corporate row; Emerging Markets overrides the
// geo on every row.
var out_buf: [1024]u8 = undefined;
var out: std.Io.Writer = .fixed(&out_buf);
const sectors = [_]FundSector{
.{ .description = "Equity / Corporate", .pct = 99.0 },
};
try emitMissingClassification(
"FAKE",
.managed_fund,
sectors[0..],
"iShares MSCI Healthcare Emerging Markets ETF",
null,
&out,
);
const written = out.buffered();
try std.testing.expect(std.mem.indexOf(u8, written, "symbol::FAKE,sector::Healthcare,geo::Emerging Markets,asset_class::Fund,pct:num:99.00") != null);
}
test "emitFundLines: null sectors -> single TODO line" {
var out_buf: [256]u8 = undefined;
var out: std.Io.Writer = .fixed(&out_buf);
try emitFundLines("VTI", "ETF", null, &out);
try emitFundLines("VTI", "ETF", null, null, null, &out);
try std.testing.expectEqualStrings(
"symbol::VTI,sector::TODO,geo::US,asset_class::ETF\n",
out.buffered(),
@ -1582,7 +2157,7 @@ test "emitFundLines: populated sectors -> one line per sector with pct" {
.{ .description = "Debt / Corporate", .pct = 47.69 },
.{ .description = "Equity / Corporate", .pct = 22.49 },
};
try emitFundLines("FAGIX", "Fund", sectors[0..], &out);
try emitFundLines("FAGIX", "Fund", sectors[0..], null, null, &out);
const written = out.buffered();
try std.testing.expect(std.mem.indexOf(u8, written, "symbol::FAGIX,sector::Debt / Corporate,geo::US,asset_class::Fund,pct:num:47.69") != null);
@ -1594,7 +2169,7 @@ test "emitFundLines: empty slice -> single TODO line (treats empty as null)" {
var out_buf: [256]u8 = undefined;
var out: std.Io.Writer = .fixed(&out_buf);
const empty: [0]FundSector = .{};
try emitFundLines("VTI", "ETF", empty[0..], &out);
try emitFundLines("VTI", "ETF", empty[0..], null, null, &out);
try std.testing.expectEqualStrings(
"symbol::VTI,sector::TODO,geo::US,asset_class::ETF\n",
out.buffered(),
@ -1610,7 +2185,7 @@ test "emitFundLines: negative pct values render correctly" {
.{ .description = "Repurchase Agreement / Other", .pct = -29.72 },
.{ .description = "Derivative-FX / Other", .pct = -0.84 },
};
try emitFundLines("PTY", "Fund", sectors[0..], &out);
try emitFundLines("PTY", "Fund", sectors[0..], null, null, &out);
const written = out.buffered();
try std.testing.expect(std.mem.indexOf(u8, written, "pct:num:-29.72") != null);
@ -1623,7 +2198,7 @@ test "emitFundLines: ETF asset_class flows through" {
const sectors = [_]FundSector{
.{ .description = "Equity / Corporate", .pct = 99.86 },
};
try emitFundLines("SOXX", "ETF", sectors[0..], &out);
try emitFundLines("SOXX", "ETF", sectors[0..], null, null, &out);
try std.testing.expectEqualStrings(
"symbol::SOXX,sector::Equity / Corporate,geo::US,asset_class::ETF,pct:num:99.86\n",
out.buffered(),

View file

@ -175,6 +175,30 @@ pub const geo = struct {
pub const unknown = "Unknown";
};
// Sector taxonomy
/// Canonical sector taxonomy (GICS-aligned 11-sector model).
/// Producers (Wikidata's `canonicalizeSector`, enrich's
/// `inferSectorFromTitle`) emit one of these strings; consumers
/// (analysis bucketing, display) compare against them.
///
/// Lives here (not in any provider) so multiple producers can
/// share one taxonomy. Adding a 12th sector or renaming an
/// existing one is a one-place edit.
pub const sector = struct {
pub const technology = "Technology";
pub const communication_services = "Communication Services";
pub const consumer_cyclical = "Consumer Cyclical";
pub const consumer_defensive = "Consumer Defensive";
pub const healthcare = "Healthcare";
pub const financial_services = "Financial Services";
pub const energy = "Energy";
pub const industrials = "Industrials";
pub const basic_materials = "Basic Materials";
pub const real_estate = "Real Estate";
pub const utilities = "Utilities";
};
/// Country-code-to-geo-bucket lookup. Producers (Wikidata today,
/// others tomorrow) hand us ISO-3166 alpha-2 codes via the
/// `ClassificationRecord.country` field; we map them to the geo