📦 alex-berliner / facebook_mp_scraper

📄 scraper_plan.md · 514 lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514# Multi-Backend Scraper Architecture Plan

## Overview
This document outlines the plan to refactor the current Facebook Marketplace scraper to support multiple backends (Facebook Marketplace, Zillow, and potentially others) while maintaining code reusability and clean separation of concerns.

## Current Architecture Analysis

### Facebook-Specific Components
- **`FacebookAuth`** (`src/auth.py`) - Handles Facebook login, cookie management, and session validation
- **`MarketplaceScraper`** (`src/scraper.py`) - Contains Facebook-specific:
  - URL construction (`MARKETPLACE_SEARCH_URL`)
  - CSS selectors (`SELECTORS` dict in config)
  - HTML parsing patterns (React minified HTML)
  - Pagination logic (infinite scroll)
  - Listing ID extraction (`extract_id_from_url()`)
- **`config.py`** - Facebook URLs, selectors, credentials
- **Authentication flow** - Facebook login page structure

### Generic/Reusable Components
- **`Listing` model** (`src/models.py`) - Data structure (reusable)
- **`ListingCache`** (`src/cache.py`) - Generic HTML caching (reusable)
- **ODS file writing** - Generic export functionality (reusable)
- **Geocoding/distance calculation** (`src/geotest.py`) - Generic (reusable)
- **Browser management** (Playwright) - Generic (reusable)

## Proposed Architecture

### 1. Abstract Base Classes

#### `src/base_scraper.py`
Create abstract base class defining the scraper interface:

```python
from abc import ABC, abstractmethod
from typing import List, Optional, Dict
from .models import Listing

class BaseScraper(ABC):
    """Abstract base class for all scrapers"""

    @abstractmethod
    def authenticate(self, verify_url: Optional[str] = None) -> bool:
        """Authenticate with the backend service"""
        pass

    @abstractmethod
    def search(self, query_or_url: str) -> None:
        """Navigate to search page"""
        pass

    @abstractmethod
    def wait_for_results(self) -> bool:
        """Wait for search results to load"""
        pass

    @abstractmethod
    def expand_pagination(self, target_count: int) -> Dict[str, str]:
        """Expand pagination to load more listings. Returns dict of listing_id -> url"""
        pass

    @abstractmethod
    def extract_listings_from_search_page(self, cached_listings: Optional[Dict[str, str]] = None) -> List[dict]:
        """Extract listing URLs from search page. Returns list of dicts with id, url, title, price"""
        pass

    @abstractmethod
    def scrape_listing_details(self, listing_url: str, cached_html: Optional[str] = None) -> Optional[Listing]:
        """Extract detailed information from a listing page"""
        pass

    @abstractmethod
    def _is_property_rentals_page(self) -> bool:
        """Check if current page is a property rentals search page (if applicable)"""
        pass

    # Common methods (can be overridden)
    def start(self) -> None:
        """Start browser"""
        # Shared implementation

    def close(self) -> None:
        """Close browser"""
        # Shared implementation

    def run(self, query: str, cache_only: bool = False, delete_ods: bool = True, close_browser: bool = True) -> List[Listing]:
        """Main execution flow - shared implementation"""
        # Common flow: authenticate -> search -> paginate -> extract -> scrape
```

#### `src/base_auth.py`
Create abstract base class for authentication:

```python
from abc import ABC, abstractmethod
from typing import Optional
from playwright.sync_api import Page, BrowserContext

class BaseAuth(ABC):
    """Abstract base class for authentication handlers"""

    @abstractmethod
    def authenticate(self, page: Page, verify_url: Optional[str] = None) -> bool:
        """Authenticate and return True if successful"""
        pass

    @abstractmethod
    def is_logged_in(self, page: Page) -> bool:
        """Check if user is logged in"""
        pass

    @abstractmethod
    def load_cookies(self) -> Optional[list]:
        """Load cookies from file"""
        pass

    @abstractmethod
    def save_cookies(self, cookies: list) -> None:
        """Save cookies to file"""
        pass
```

### 2. Backend-Specific Implementations

#### `src/facebook/` Directory Structure
```
src/facebook/
├── __init__.py
├── scraper.py          # FacebookScraper extends BaseScraper
├── auth.py             # FacebookAuth extends BaseAuth
└── config.py           # Facebook-specific config
```

#### `src/zillow/` Directory Structure
```
src/zillow/
├── __init__.py
├── scraper.py          # ZillowScraper extends BaseScraper
├── auth.py             # ZillowAuth extends BaseAuth (may be minimal)
└── config.py           # Zillow-specific config
```

### 3. Configuration Refactoring

#### `config.py` - Updated Structure
```python
# Backend registry
BACKENDS = {
    'facebook': {
        'name': 'Facebook Marketplace',
        'module': 'src.facebook',
        'scraper_class': 'FacebookScraper',
        'auth_class': 'FacebookAuth',
        'config_module': 'src.facebook.config',
    },
    'zillow': {
        'name': 'Zillow',
        'module': 'src.zillow',
        'scraper_class': 'ZillowScraper',
        'auth_class': 'ZillowAuth',
        'config_module': 'src.zillow.config',
    }
}

# Default backend
DEFAULT_BACKEND = 'facebook'

# Shared settings (used by all backends)
MAX_LISTINGS = 100
HEADLESS = False
SCROLL_WAIT_TIME = 2000
PAGINATION_MAX_SCROLLS = 50
NO_CHANGE_LIMIT = 3

# Shared file paths
DATA_DIR = BASE_DIR / "data"
CACHE_DIR = DATA_DIR / "cache"
ODS_FILE = DATA_DIR / "listings.ods"
GEOCODED_ADDRESSES_FILE = DATA_DIR / "geocoded_addresses.json"
DISTANCE_CACHE_FILE = DATA_DIR / "distance_cache.json"
```

#### `src/facebook/config.py`
```python
# Facebook-specific configuration
FACEBOOK_BASE_URL = "https://www.facebook.com"
FACEBOOK_LOGIN_URL = "https://www.facebook.com/login"
MARKETPLACE_BASE_URL = "https://www.facebook.com/marketplace"
MARKETPLACE_SEARCH_URL = "https://www.facebook.com/marketplace/search/?query={query}"

SELECTORS = {
    "listing_card": "[data-testid='marketplace-listing']",
    "listing_link": "a[href*='/marketplace/item/']",
    "login_button": "#loginbutton, button[name='login']",
    "email_input": "#email",
    "password_input": "#pass",
    "logged_in_indicator": "[aria-label='Your profile'], [aria-label='Account']",
}

COOKIES_FILE = DATA_DIR / "facebook_cookies.json"
```

#### `src/zillow/config.py`
```python
# Zillow-specific configuration
ZILLOW_BASE_URL = "https://www.zillow.com"
ZILLOW_SEARCH_URL = "https://www.zillow.com/homes/{query}"

SELECTORS = {
    "listing_card": "[data-testid='property-card']",
    "listing_link": "a[href*='/homedetails/']",
    # Zillow may not require login for basic searches
    "login_button": None,
    "email_input": None,
    "password_input": None,
    "logged_in_indicator": None,
}

COOKIES_FILE = DATA_DIR / "zillow_cookies.json"
```

### 4. Refactoring Steps

#### Step 1: Create Base Classes
1. Create `src/base_scraper.py` with `BaseScraper` abstract class
2. Create `src/base_auth.py` with `BaseAuth` abstract class
3. Move common browser management code to base classes

#### Step 2: Extract Facebook Implementation
1. Create `src/facebook/` directory
2. Move `FacebookAuth` from `src/auth.py` to `src/facebook/auth.py`
   - Update to extend `BaseAuth`
   - Keep Facebook-specific logic
3. Refactor `MarketplaceScraper` to `FacebookScraper` in `src/facebook/scraper.py`
   - Extend `BaseScraper`
   - Move Facebook-specific methods
   - Keep shared methods in base class
4. Create `src/facebook/config.py` with Facebook-specific config
5. Create `src/facebook/__init__.py` to export classes

#### Step 3: Implement Zillow Backend
1. Create `src/zillow/` directory
2. Create `src/zillow/auth.py` with `ZillowAuth` extending `BaseAuth`
   - May be minimal if Zillow doesn't require login
3. Create `src/zillow/scraper.py` with `ZillowScraper` extending `BaseScraper`
   - Implement Zillow-specific:
     - URL construction
     - Selectors
     - HTML parsing
     - Pagination (likely page-based, not infinite scroll)
     - Listing ID extraction
4. Create `src/zillow/config.py` with Zillow-specific config
5. Create `src/zillow/__init__.py` to export classes

#### Step 4: Update Main Entry Point
1. Update `main.py` to support backend selection:
   ```python
   parser.add_argument(
       '--backend',
       choices=['facebook', 'zillow'],
       default='facebook',
       help='Backend to use for scraping'
   )
   ```
2. Add backend factory function:
   ```python
   def create_scraper(backend_name: str, **kwargs) -> BaseScraper:
       backend_config = config.BACKENDS[backend_name]
       module = importlib.import_module(backend_config['module'])
       scraper_class = getattr(module, backend_config['scraper_class'])
       return scraper_class(**kwargs)
   ```
3. Update main flow to use factory

#### Step 5: Update Models
1. Review `Listing` model - may need to add Zillow-specific fields
2. Update `extract_id_from_url()` to be backend-aware or move to backend-specific modules
3. Update `normalize_url()` similarly

#### Step 6: Update Cache
1. Ensure `ListingCache` works with different URL formats
2. May need backend-specific cache directories:
   - `data/cache/facebook/`
   - `data/cache/zillow/`

### 5. Implementation Details

#### Facebook Scraper Migration
- **Current**: `src/scraper.py` → `MarketplaceScraper`
- **New**: `src/facebook/scraper.py` → `FacebookScraper extends BaseScraper`
- **Changes**:
  - Move Facebook-specific selectors to `src/facebook/config.py`
  - Move Facebook URL construction to `FacebookScraper.search()`
  - Keep common methods (ODS writing, geocoding) in base class
  - Move `_is_property_rentals_page()` to `FacebookScraper`

#### Zillow Scraper Implementation
- **New**: `src/zillow/scraper.py` → `ZillowScraper extends BaseScraper`
- **Key Differences from Facebook**:
  - **Authentication**: May not require login (public listings)
  - **Pagination**: Likely page-based (`?page=2`) vs infinite scroll
  - **HTML Structure**: Different selectors, may be more structured
  - **URL Format**: `/homedetails/{zpid}/` vs `/marketplace/item/{id}/`
  - **Data Fields**: May have different/additional fields (ZPID, Zestimate, etc.)

#### Shared Utilities
- **ODS Writing**: Keep in base class or separate utility module
- **Geocoding**: Already generic in `src/geotest.py`
- **Cache Management**: Generic `ListingCache` should work for both

### 6. URL Handling

#### Backend Detection
Add function to detect backend from URL:
```python
def detect_backend_from_url(url: str) -> Optional[str]:
    """Detect backend from URL"""
    if 'facebook.com/marketplace' in url:
        return 'facebook'
    elif 'zillow.com' in url:
        return 'zillow'
    return None
```

#### URL Normalization
- Keep generic `normalize_url()` in `src/models.py` for basic normalization
- Backend-specific normalization in each scraper class

### 7. Testing Strategy

#### Unit Tests
- Test base classes with mock implementations
- Test Facebook scraper independently
- Test Zillow scraper independently
- Test backend factory/selection

#### Integration Tests
- Test full scraping flow for each backend
- Test multi-backend runs
- Test cache isolation between backends

### 8. Migration Path

#### Backward Compatibility
- Keep old `MarketplaceScraper` import working temporarily:
  ```python
  # src/scraper.py (deprecated, redirects)
  from .facebook.scraper import FacebookScraper as MarketplaceScraper
  ```
- Update `main.py` to default to Facebook backend
- Add deprecation warnings

#### Gradual Migration
1. Phase 1: Create base classes, refactor Facebook (keep old code working)
2. Phase 2: Implement Zillow backend
3. Phase 3: Update main.py to support both
4. Phase 4: Remove deprecated code

### 9. File Structure After Refactoring

```
marketplace_apt/
├── src/
│   ├── __init__.py
│   ├── base_scraper.py          # NEW: Abstract base class
│   ├── base_auth.py             # NEW: Abstract auth base class
│   ├── models.py                # Generic Listing model
│   ├── cache.py                 # Generic cache
│   ├── geotest.py               # Generic geocoding
│   ├── facebook/                # NEW: Facebook backend
│   │   ├── __init__.py
│   │   ├── scraper.py           # FacebookScraper
│   │   ├── auth.py              # FacebookAuth
│   │   └── config.py            # Facebook config
│   └── zillow/                  # NEW: Zillow backend
│       ├── __init__.py
│       ├── scraper.py           # ZillowScraper
│       ├── auth.py              # ZillowAuth
│       └── config.py            # Zillow config
├── data/
│   ├── cache/
│   │   ├── facebook/            # Facebook cached HTML
│   │   └── zillow/              # Zillow cached HTML
│   ├── facebook_cookies.json
│   ├── zillow_cookies.json
│   └── listings.ods             # Combined results
├── config.py                    # Shared config + backend registry
├── main.py                      # Updated entry point
└── requirements.txt
```

### 10. Key Design Decisions

#### 1. Backend Selection
- **Command-line argument**: `--backend facebook|zillow`
- **Auto-detection**: Detect from URL if possible
- **Default**: Facebook (for backward compatibility)

#### 2. Cache Isolation
- **Separate directories**: `data/cache/facebook/` and `data/cache/zillow/`
- **Shared ODS**: Combined results in single ODS file
- **Cache keys**: Backend-specific ID extraction

#### 3. Authentication
- **Facebook**: Requires login (cookies/credentials)
- **Zillow**: May not require login (public listings)
- **Base class**: Handles both cases gracefully

#### 4. Pagination Strategies
- **Facebook**: Infinite scroll (current implementation)
- **Zillow**: Likely page-based pagination
- **Base class**: Abstract method allows different implementations

#### 5. Data Model
- **Listing model**: Keep generic, add optional fields if needed
- **Backend-specific fields**: Store in `Listing` as optional attributes or extend model

### 11. Potential Challenges

#### Challenge 1: Different Pagination Mechanisms
- **Solution**: Abstract `expand_pagination()` method allows different implementations
- Facebook: Infinite scroll
- Zillow: Page navigation

#### Challenge 2: Different HTML Structures
- **Solution**: Backend-specific selectors and parsing logic
- Each scraper implements its own `extract_listings_from_search_page()`

#### Challenge 3: Authentication Differences
- **Solution**: Abstract auth base class with optional methods
- Zillow auth may be minimal/no-op if no login required

#### Challenge 4: Rate Limiting
- **Solution**: Backend-specific rate limiting in each scraper
- May need different delays/strategies per backend

#### Challenge 5: URL Format Differences
- **Solution**: Backend-specific URL parsing and ID extraction
- Move `extract_id_from_url()` to backend-specific modules

### 12. Implementation Checklist

#### Phase 1: Foundation
- [ ] Create `src/base_scraper.py` with `BaseScraper` abstract class
- [ ] Create `src/base_auth.py` with `BaseAuth` abstract class
- [ ] Extract common browser management to base classes
- [ ] Create backend registry in `config.py`

#### Phase 2: Facebook Refactoring
- [ ] Create `src/facebook/` directory structure
- [ ] Move `FacebookAuth` to `src/facebook/auth.py`
- [ ] Refactor `MarketplaceScraper` to `FacebookScraper` in `src/facebook/scraper.py`
- [ ] Create `src/facebook/config.py`
- [ ] Update imports and test Facebook backend

#### Phase 3: Zillow Implementation
- [ ] Create `src/zillow/` directory structure
- [ ] Implement `ZillowAuth` in `src/zillow/auth.py`
- [ ] Implement `ZillowScraper` in `src/zillow/scraper.py`
- [ ] Create `src/zillow/config.py`
- [ ] Test Zillow backend

#### Phase 4: Integration
- [ ] Update `main.py` with backend selection
- [ ] Create backend factory function
- [ ] Update cache to support backend-specific directories
- [ ] Test multi-backend runs
- [ ] Update documentation

#### Phase 5: Cleanup
- [ ] Remove deprecated code
- [ ] Update all imports
- [ ] Add deprecation warnings for old imports
- [ ] Final testing

### 13. Estimated Effort

- **Base classes**: 2-3 hours
- **Facebook refactoring**: 4-6 hours
- **Zillow implementation**: 8-12 hours (depends on Zillow's complexity)
- **Integration**: 2-3 hours
- **Testing**: 3-4 hours
- **Total**: ~20-30 hours

### 14. Benefits

1. **Code Reusability**: Common functionality shared across backends
2. **Maintainability**: Each backend isolated and easier to maintain
3. **Extensibility**: Easy to add new backends (Craigslist, Apartments.com, etc.)
4. **Flexibility**: Can run multiple backends in same session
5. **Testing**: Each backend can be tested independently

### 15. Risks and Mitigation

#### Risk 1: Breaking Changes
- **Mitigation**: Maintain backward compatibility during migration
- Keep old imports working with deprecation warnings

#### Risk 2: Zillow Anti-Bot Measures
- **Mitigation**: Research Zillow's bot detection before implementation
- May need proxies, different user agents, or slower scraping

#### Risk 3: Different Data Models
- **Mitigation**: Keep `Listing` model flexible with optional fields
- Consider extending model if needed

#### Risk 4: Performance Impact
- **Mitigation**: Backend-specific optimizations
- Cache isolation prevents conflicts

## Conclusion

This refactoring will transform the codebase from a Facebook-specific scraper to a flexible multi-backend scraping framework. The architecture allows for easy addition of new backends while maintaining code quality and reusability. The estimated effort is moderate (~20-30 hours) but will provide significant long-term benefits in terms of maintainability and extensibility.