Facility data hygiene: merge duplicates, drop junk-named facilities
Cleans up the crawl-generated facility table that surfaced garbage on /Facilities («بیمارستان هستم», «... از مدجابز», bare «کلینیک», «سازمان برنامه جنوبی» x3): - FacilityMatcher.IsJunkName: shared detector for non-names — bare type words, cores made only of filler/verb tokens, and leaked crawl-source/placeholder text. Added داروخانه/آسایشگاه to the generic type words so bare ones are caught and dedupe better. - HeuristicListingParser.ExtractFacilityName now rejects junk candidates (and emoji), so new ingests fall back to the shared placeholder instead of forging a fake facility. - IngestionService.MergeAndCleanFacilitiesAsync (+ admin button): folds junk facilities into the placeholder and merges Persian-fuzzy duplicates into one keeper, repointing their shifts/jobs first. Hard guard: only purely crawl-generated, unmanaged facilities are removed — employer-owned and verified facilities are never touched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -218,12 +218,17 @@ public class HeuristicListingParser : IListingParser
|
||||
{
|
||||
if (NameStops.Contains(w)) break;
|
||||
if (Regex.IsMatch(w, @"\d")) break; // numbers/phones aren't names
|
||||
if (!w.Any(char.IsLetter)) break; // emoji / punctuation («📍») isn't a name
|
||||
if (w.Length == 1) break; // stray letters
|
||||
picked.Add(w);
|
||||
if (picked.Count >= 3) break;
|
||||
}
|
||||
if (picked.Count == 0) continue; // bare keyword (e.g. just «بیمارستان») isn't useful
|
||||
return (kw + " " + string.Join(" ", picked)).Trim();
|
||||
var candidate = (kw + " " + string.Join(" ", picked)).Trim();
|
||||
// Reject names that are only filler/verb/source noise («بیمارستان هستم», «... از مدجابز») —
|
||||
// a real name couldn't be extracted, so fall back to the shared placeholder downstream.
|
||||
if (Scraping.FacilityMatcher.IsJunkName(candidate)) continue;
|
||||
return candidate;
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user