Facility data hygiene: merge duplicates, drop junk-named facilities
CI/CD / CI · dotnet build (push) Successful in 1m51s
CI/CD / Deploy · hamkadr (push) Successful in 2m17s

Cleans up the crawl-generated facility table that surfaced garbage on /Facilities
(«بیمارستان هستم», «... از مدجابز», bare «کلینیک», «سازمان برنامه جنوبی» x3):

- FacilityMatcher.IsJunkName: shared detector for non-names — bare type words, cores
  made only of filler/verb tokens, and leaked crawl-source/placeholder text. Added
  داروخانه/آسایشگاه to the generic type words so bare ones are caught and dedupe better.
- HeuristicListingParser.ExtractFacilityName now rejects junk candidates (and emoji), so
  new ingests fall back to the shared placeholder instead of forging a fake facility.
- IngestionService.MergeAndCleanFacilitiesAsync (+ admin button): folds junk facilities
  into the placeholder and merges Persian-fuzzy duplicates into one keeper, repointing
  their shifts/jobs first. Hard guard: only purely crawl-generated, unmanaged facilities
  are removed — employer-owned and verified facilities are never touched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
soroush.asadi
2026-06-21 05:40:29 +03:30
parent 8be275596b
commit 88eca92333
5 changed files with 137 additions and 2 deletions
@@ -218,12 +218,17 @@ public class HeuristicListingParser : IListingParser
{
if (NameStops.Contains(w)) break;
if (Regex.IsMatch(w, @"\d")) break; // numbers/phones aren't names
if (!w.Any(char.IsLetter)) break; // emoji / punctuation («📍») isn't a name
if (w.Length == 1) break; // stray letters
picked.Add(w);
if (picked.Count >= 3) break;
}
if (picked.Count == 0) continue; // bare keyword (e.g. just «بیمارستان») isn't useful
return (kw + " " + string.Join(" ", picked)).Trim();
var candidate = (kw + " " + string.Join(" ", picked)).Trim();
// Reject names that are only filler/verb/source noise («بیمارستان هستم», «... از مدجابز») —
// a real name couldn't be extracted, so fall back to the shared placeholder downstream.
if (Scraping.FacilityMatcher.IsJunkName(candidate)) continue;
return candidate;
}
return null;
}