Commit Graph

3 Commits

Author SHA1 Message Date
soroush.asadi 3e65c88765 Strip generic facility descriptors so distinctive names dont false-merge
CI/CD / CI · dotnet build (push) Successful in 49s
CI/CD / Deploy · hamkadr (push) Successful in 1m1s
FacilityMatcher treated «شبانه روزی»/«خیریه»/«دولتی»/«خصوصی» as part of a name, so a real
facility merged into a generic one when they shared a descriptor — «درمانگاه شبانه‌روزی اسفند»
collapsed into the existing «پلی کلینیک شبانه روزی», losing «اسفند». Add these descriptors to
the stripped type-words so matching compares the distinctive core («اسفند») instead. Side
benefit: bare descriptor-only names («پلی کلینیک شبانه روزی») now resolve to junk and get
folded into the placeholder by the cleanup, rather than masquerading as a real facility.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 14:00:00 +03:30
soroush.asadi 88eca92333 Facility data hygiene: merge duplicates, drop junk-named facilities
CI/CD / CI · dotnet build (push) Successful in 1m51s
CI/CD / Deploy · hamkadr (push) Successful in 2m17s
Cleans up the crawl-generated facility table that surfaced garbage on /Facilities
(«بیمارستان هستم», «... از مدجابز», bare «کلینیک», «سازمان برنامه جنوبی» x3):

- FacilityMatcher.IsJunkName: shared detector for non-names — bare type words, cores
  made only of filler/verb tokens, and leaked crawl-source/placeholder text. Added
  داروخانه/آسایشگاه to the generic type words so bare ones are caught and dedupe better.
- HeuristicListingParser.ExtractFacilityName now rejects junk candidates (and emoji), so
  new ingests fall back to the shared placeholder instead of forging a fake facility.
- IngestionService.MergeAndCleanFacilitiesAsync (+ admin button): folds junk facilities
  into the placeholder and merges Persian-fuzzy duplicates into one keeper, repointing
  their shifts/jobs first. Hard guard: only purely crawl-generated, unmanaged facilities
  are removed — employer-owned and verified facilities are never touched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 05:40:29 +03:30
soroush.asadi e6a796ab27 Match crawled listings to existing facilities (fuzzy) before creating new
CI/CD / CI · dotnet build (push) Successful in 1m28s
CI/CD / Deploy · hamkadr (push) Successful in 2m24s
When publishing a scraped listing we now look for a facility we already
have that is exactly or closely the same, and only create a new one when
there is no match — avoiding duplicates like «بیمارستان میلاد» vs «میلاد».

- ListingParser: extract a facility name (keyword + distinctive words) from
  the post and surface it in the parser notes.
- FacilityMatcher: Persian-aware normalization (ي/ك, ZWNJ, punctuation),
  type-word stripping for a "core" name, contains + Levenshtein similarity,
  and FindBest (same-city exact → any-city exact → same-city fuzzy → fuzzy).
- Review (manual publish): auto-select a matching facility or prefill the
  new-facility name; resolve-or-create uses fuzzy match; dropdown preselects.
- IngestionService (auto-publish): reuse FacilityMatcher against a run-wide
  facility list (grows as new ones are created) instead of exact-name only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 07:14:48 +03:30