FacilityMatcher treated «شبانه روزی»/«خیریه»/«دولتی»/«خصوصی» as part of a name, so a real
facility merged into a generic one when they shared a descriptor — «درمانگاه شبانهروزی اسفند»
collapsed into the existing «پلی کلینیک شبانه روزی», losing «اسفند». Add these descriptors to
the stripped type-words so matching compares the distinctive core («اسفند») instead. Side
benefit: bare descriptor-only names («پلی کلینیک شبانه روزی») now resolve to junk and get
folded into the placeholder by the cleanup, rather than masquerading as a real facility.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cleans up the crawl-generated facility table that surfaced garbage on /Facilities
(«بیمارستان هستم», «... از مدجابز», bare «کلینیک», «سازمان برنامه جنوبی» x3):
- FacilityMatcher.IsJunkName: shared detector for non-names — bare type words, cores
made only of filler/verb tokens, and leaked crawl-source/placeholder text. Added
داروخانه/آسایشگاه to the generic type words so bare ones are caught and dedupe better.
- HeuristicListingParser.ExtractFacilityName now rejects junk candidates (and emoji), so
new ingests fall back to the shared placeholder instead of forging a fake facility.
- IngestionService.MergeAndCleanFacilitiesAsync (+ admin button): folds junk facilities
into the placeholder and merges Persian-fuzzy duplicates into one keeper, repointing
their shifts/jobs first. Hard guard: only purely crawl-generated, unmanaged facilities
are removed — employer-owned and verified facilities are never touched.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When publishing a scraped listing we now look for a facility we already
have that is exactly or closely the same, and only create a new one when
there is no match — avoiding duplicates like «بیمارستان میلاد» vs «میلاد».
- ListingParser: extract a facility name (keyword + distinctive words) from
the post and surface it in the parser notes.
- FacilityMatcher: Persian-aware normalization (ي/ك, ZWNJ, punctuation),
type-word stripping for a "core" name, contains + Levenshtein similarity,
and FindBest (same-city exact → any-city exact → same-city fuzzy → fuzzy).
- Review (manual publish): auto-select a matching facility or prefill the
new-facility name; resolve-or-create uses fuzzy match; dropdown preselects.
- IngestionService (auto-publish): reuse FacilityMatcher against a run-wide
facility list (grows as new ones are created) instead of exact-name only.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>