Divar listings showed only a one-line summary («پرستار کودک ۳ روز … — پرداخت توافقی — … در
شادمان») because the scraper stored the search-result row text and only pulled phone + coords from
the post detail. Now FetchDetailAsync also extracts the full ad body (the longest free-text string
in the detail JSON, skipping Divar safety boilerplate that mentions «دیوار») and appends it, so the
listing carries the rich description users see on Divar. Applies to new crawls; existing rows keep
their short text until re-ingested.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Harvest now keeps each post's token, so we build a real post URL
(divar.ir/v/{token}) instead of a generic link.
- For each post we fetch the detail JSON (posts-v2/web/{token}) and
harvest any contact number from it — covering the very common case
where the poster writes the phone into the ad description. Divar's
click-to-reveal is login-gated, so this gets the in-text numbers
without auth; fails soft (blocking/errors → skip).
- HarvestPhones hardened with digit-boundary guards so it can't grab a
slice of a longer numeric id/timestamp inside JSON.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Divar's /v8/web-search GET returns a BLOCKING_VIEW (anti-bot), so the old source pulled nothing useful and could scrape the block message. Switch to the working POST /v8/postlist/w/search with a browser User-Agent and a city-id map (numeric id passthrough; tehran=1 default). Skip responses that are non-2xx or contain BLOCKING_VIEW so the block page is never ingested. Verified locally: fetched 25 real Tehran job posts into the review queue, 0 block messages.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Each ingestion source now decides independently whether to route through the proxy: added TelegramUseProxy/BaleUseProxy/DivarUseProxy/MedjobsUseProxy/WebsitesUseProxy flags (migration). ScrapeHttpClients.For(s, useProxy) takes the source's own flag; a source is proxied only when its flag is on AND a proxy URL is set. Settings 'sources' tab: removed the global enable checkbox, kept the proxy address field, and added an «از پروکسی استفاده شود» checkbox under each source. Old IngestProxyEnabled column kept for compatibility but no longer gates routing.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Telegram and some sources are filtered in Iran. .NET cannot speak vmess/vless/trojan, so add an Xray sidecar (compose service 'xray', behind the 'proxy' profile) that converts the admin's config into a local SOCKS5 proxy (xray:10808). New ScrapeHttpClients provider builds a proxied or direct HttpClient (WebProxy supports socks5/socks4/http) cached per proxy URL; all five ingestion sources (Telegram/Bale/Divar/Medjobs/Websites) now use it. Admin settings gain IngestProxyEnabled + IngestProxyUrl (migration; UI under sources). Added deploy/xray/config.json template + README with vmess/vless/trojan examples.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- AppSetting gains source config: AutoIngestEnabled, IngestIntervalMinutes, Telegram/Bale/Divar enabled+channels/token/queries
- IListingSource.FetchAsync(AppSetting) — sources read config from DB, not IOptions/appsettings; sample source dev-only
- IngestionWorker reads AutoIngest+interval from DB each cycle (toggle at runtime, no redeploy)
- /Admin/Settings gets a 'منابع جمعآوری' section; removed Ingestion env/appsettings + compose env vars
- ENV_FILE shrinks to HOST_PORT + POSTGRES_* + ADMIN_PHONE (AI + sources are all in-admin); migration
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>