Commit Graph

9 Commits

Author SHA1 Message Date
soroush.asadi c778b87e79 Capture the full Divar ad description, not just the search-row summary
CI/CD / CI · dotnet build (push) Successful in 1m28s
CI/CD / Deploy · hamkadr (push) Successful in 2m22s
Divar listings showed only a one-line summary («پرستار کودک ۳ روز … — پرداخت توافقی — … در
شادمان») because the scraper stored the search-result row text and only pulled phone + coords from
the post detail. Now FetchDetailAsync also extracts the full ad body (the longest free-text string
in the detail JSON, skipping Divar safety boilerplate that mentions «دیوار») and appends it, so the
listing carries the rich description users see on Divar. Applies to new crawls; existing rows keep
their short text until re-ingested.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 19:04:30 +03:30
soroush.asadi 380243b669 Divar geo-coords to facility map + medical gate + RawListing FK/geo migrations
CI/CD / CI · dotnet build (push) Successful in 2m6s
CI/CD / Deploy · hamkadr (push) Successful in 2m3s
2026-06-09 21:38:55 +03:30
soroush.asadi a5d6e212e2 Divar: capture post token + harvest phone from full ad detail
CI/CD / CI · dotnet build (push) Successful in 2m4s
CI/CD / Deploy · hamkadr (push) Successful in 2m18s
- Harvest now keeps each post's token, so we build a real post URL
  (divar.ir/v/{token}) instead of a generic link.
- For each post we fetch the detail JSON (posts-v2/web/{token}) and
  harvest any contact number from it — covering the very common case
  where the poster writes the phone into the ad description. Divar's
  click-to-reveal is login-gated, so this gets the in-text numbers
  without auth; fails soft (blocking/errors → skip).
- HarvestPhones hardened with digit-boundary guards so it can't grab a
  slice of a longer numeric id/timestamp inside JSON.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 08:28:37 +03:30
soroush.asadi 2485173aad [Ingest] Fix Divar: use POST search API (GET was anti-bot blocked)
CI/CD / CI · dotnet build (push) Successful in 1m51s
CI/CD / Deploy · hamkadr (push) Successful in 2m8s
Divar's /v8/web-search GET returns a BLOCKING_VIEW (anti-bot), so the old source pulled nothing useful and could scrape the block message. Switch to the working POST /v8/postlist/w/search with a browser User-Agent and a city-id map (numeric id passthrough; tehran=1 default). Skip responses that are non-2xx or contain BLOCKING_VIEW so the block page is never ingested. Verified locally: fetched 25 real Tehran job posts into the review queue, 0 block messages.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 21:23:36 +03:30
soroush.asadi b1e474ba33 [Ingest] Per-source proxy toggle instead of one global switch
CI/CD / CI · dotnet build (push) Successful in 56s
CI/CD / Deploy · hamkadr (push) Successful in 1m6s
Each ingestion source now decides independently whether to route through the proxy: added TelegramUseProxy/BaleUseProxy/DivarUseProxy/MedjobsUseProxy/WebsitesUseProxy flags (migration). ScrapeHttpClients.For(s, useProxy) takes the source's own flag; a source is proxied only when its flag is on AND a proxy URL is set. Settings 'sources' tab: removed the global enable checkbox, kept the proxy address field, and added an «از پروکسی استفاده شود» checkbox under each source. Old IngestProxyEnabled column kept for compatibility but no longer gates routing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 18:46:48 +03:30
soroush.asadi cea27c8684 [Ingest] Route scraping through an optional V2Ray/Xray proxy (Telegram in Iran)
CI/CD / CI · dotnet build (push) Successful in 53s
CI/CD / Deploy · hamkadr (push) Successful in 1m12s
Telegram and some sources are filtered in Iran. .NET cannot speak vmess/vless/trojan, so add an Xray sidecar (compose service 'xray', behind the 'proxy' profile) that converts the admin's config into a local SOCKS5 proxy (xray:10808). New ScrapeHttpClients provider builds a proxied or direct HttpClient (WebProxy supports socks5/socks4/http) cached per proxy URL; all five ingestion sources (Telegram/Bale/Divar/Medjobs/Websites) now use it. Admin settings gain IngestProxyEnabled + IngestProxyUrl (migration; UI under sources). Added deploy/xray/config.json template + README with vmess/vless/trojan examples.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 17:53:17 +03:30
soroush.asadi 3c08c1a265 Move ingestion + Telegram/Bale/Divar config to DB-backed admin settings
CI/CD / CI · dotnet build (push) Successful in 6m22s
CI/CD / Deploy · hamkadr (push) Failing after 3s
- AppSetting gains source config: AutoIngestEnabled, IngestIntervalMinutes, Telegram/Bale/Divar enabled+channels/token/queries
- IListingSource.FetchAsync(AppSetting) — sources read config from DB, not IOptions/appsettings; sample source dev-only
- IngestionWorker reads AutoIngest+interval from DB each cycle (toggle at runtime, no redeploy)
- /Admin/Settings gets a 'منابع جمع‌آوری' section; removed Ingestion env/appsettings + compose env vars
- ENV_FILE shrinks to HOST_PORT + POSTGRES_* + ADMIN_PHONE (AI + sources are all in-admin); migration

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 00:44:11 +03:30
soroush.asadi 36bb165438 Real channel fetch (Telegram/Bale/Divar) + AI-audited automation engine + CI/CD
- Fetch: Telegram via t.me/s, Bale via Bot API, Divar via web-search (HttpClient, config-gated, graceful)
- AI layer: DB-backed AppSetting (mode auto/manual, thresholds, AI endpoint/model/key/prompt/framework, auto-approve); OpenAI-compatible IAiAuditor (self-host/Iranian endpoints; fails safe to manual)
- Pipeline: fetch → dedupe(hash) → parse → validate → AI audit → Discard/Flag/Queue/auto-publish (resolve-or-create facility)
- Admin: /Admin/Settings automation+AI panel; queue shows confidence + AI verdict; flagged section
- CI/CD: Dockerfile, docker-compose.prod.yml, .gitea/workflows/ci-cd.yml, nginx vhost, DEPLOY.md; forwarded headers + /healthz + prod reference-only seed; ports 22/80/443 only

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 17:41:02 +03:30
soroush.asadi 931b7b6ffb Add scrape/ingestion engine + validation, and 24h shift hour-range visualization
Scrape engine (Services/Scraping/): pluggable IListingSource (working sample + Telegram/Divar credential-ready stubs) → IngestionService (content-hash dedupe → parse → validate → review queue) → ListingValidator (completeness score + spam screen) → IngestionWorker (config-gated hosted service). RawListing gains ContentHash/Confidence/ValidationNotes; RawListingStatus.Flagged. Admin /Admin gets run-now, source list, confidence + flagged queue.

Hour-range viz: _HourBar 24h timeline bar (colored by type, overnight wrap) on shift cards, recommendation cards, and detail.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-03 08:18:19 +03:30