AI Skill Report Card
Scraping Ecommerce Sites
E-commerce Site Intelligence Scraping
Quick Start
Pythonimport requests from bs4 import BeautifulSoup import json from urllib.parse import urljoin, urlparse def scrape_site_intelligence(domain): """Complete e-commerce site analysis""" results = { 'basic_info': get_basic_info(domain), 'traffic_data': get_traffic_data(domain), 'pricing_strategy': analyze_pricing(domain), 'product_catalog': extract_product_data(domain) } return results # Example usage data = scrape_site_intelligence("example-store.com")
Recommendation▾
Add concrete input/output examples for each function showing actual scraped data from real sites
Workflow
Progress:
- Extract basic site information
- Fetch traffic analytics via APIs
- Analyze pricing strategies
- Map product catalog structure
- Compile competitive intelligence report
Step 1: Basic Information Extraction
Pythondef get_basic_info(domain): url = f"https://{domain}" response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}) soup = BeautifulSoup(response.content, 'html.parser') return { 'domain': domain, 'title': soup.find('title').text.strip() if soup.find('title') else '', 'description': soup.find('meta', {'name': 'description'})['content'] if soup.find('meta', {'name': 'description'}) else '', 'keywords': soup.find('meta', {'name': 'keywords'})['content'] if soup.find('meta', {'name': 'keywords'}) else '', 'favicon': urljoin(url, soup.find('link', rel='icon')['href']) if soup.find('link', rel='icon') else '' }
Step 2: Traffic Data Collection
Pythondef get_traffic_data(domain): # SimilarWeb API example api_key = "your_similarweb_key" endpoints = { 'overview': f"https://api.similarweb.com/v1/website/{domain}/total-traffic-and-engagement/visits", 'traffic_sources': f"https://api.similarweb.com/v1/website/{domain}/traffic-sources/overview-share", 'geography': f"https://api.similarweb.com/v1/website/{domain}/geo/traffic-shares/countries" } traffic_data = {} for key, url in endpoints.items(): response = requests.get(url, params={'api_key': api_key, 'start_date': '2024-01', 'end_date': '2024-03'}) traffic_data[key] = response.json() return { 'monthly_visits': traffic_data['overview']['visits'][0]['visits'], 'bounce_rate': traffic_data['overview']['visits'][0]['bounce_rate'], 'avg_duration': traffic_data['overview']['visits'][0]['average_visit_duration'], 'traffic_sources': traffic_data['traffic_sources'], 'top_countries': traffic_data['geography']['records'][:5] }
Step 3: Pricing Analysis
Pythondef analyze_pricing(domain): url = f"https://{domain}" response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') # Common price selectors price_selectors = [ '.price', '.product-price', '[class*="price"]', '.cost', '.amount', '[data-price]' ] prices = [] for selector in price_selectors: elements = soup.select(selector) for elem in elements: price_text = elem.get_text().strip() # Extract numeric values import re price_match = re.search(r'[\d,]+\.?\d*', price_text) if price_match: prices.append(float(price_match.group().replace(',', ''))) return { 'price_range': {'min': min(prices), 'max': max(prices)} if prices else None, 'average_price': sum(prices) / len(prices) if prices else None, 'currency': extract_currency(soup), 'discount_indicators': len(soup.select('[class*="sale"], [class*="discount"], [class*="off"]')) }
Step 4: Product Catalog Mapping
Pythondef extract_product_data(domain): sitemap_url = f"https://{domain}/sitemap.xml" try: response = requests.get(sitemap_url) # Parse sitemap for product URLs soup = BeautifulSoup(response.content, 'xml') product_urls = [loc.text for loc in soup.find_all('loc') if '/product' in loc.text or '/item' in loc.text] except: # Fallback: crawl category pages product_urls = discover_product_urls(domain) categories = extract_categories(domain) return { 'total_products': len(product_urls), 'categories': categories, 'product_sample': product_urls[:10] } def extract_categories(domain): url = f"https://{domain}" response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') # Common navigation selectors nav_selectors = ['nav', '.navigation', '.menu', '.categories'] categories = [] for selector in nav_selectors: nav_elem = soup.select_one(selector) if nav_elem: links = nav_elem.find_all('a', href=True) categories.extend([link.text.strip() for link in links if link.text.strip()]) return list(set(categories))
Recommendation▾
Include error handling code in the Quick Start example - current code will fail on many real sites
Examples
Example 1: Basic Site Analysis
Input: scrape_site_intelligence("shopify-store.com")
Output:
JSON{ "basic_info": { "domain": "shopify-store.com", "title": "Premium Fashion Store - Latest Trends", "description": "Discover premium fashion with free shipping worldwide", "keywords": "fashion, clothing, premium, trends" }, "traffic_data": { "monthly_visits": 125000, "bounce_rate": 0.45, "avg_duration": 185, "top_countries": ["US", "UK", "CA", "AU", "DE"] } }
Example 2: Pricing Strategy Analysis Input: Category page analysis Output:
JSON{ "price_range": {"min": 29.99, "max": 299.99}, "average_price": 89.50, "currency": "USD", "discount_indicators": 15 }
Recommendation▾
Add a complete working example that demonstrates scraping a specific site like Amazon or eBay with expected outputs
Best Practices
- Respect robots.txt: Check
/robots.txtbefore scraping - Rate limiting: Add delays between requests (1-2 seconds minimum)
- User agents: Rotate realistic browser user agents
- API integration: Use SimilarWeb, Ahrefs, or SEMrush APIs for traffic data
- Error handling: Implement retry logic for failed requests
- Data validation: Verify extracted prices and metrics
- Legal compliance: Ensure scraping complies with website terms of service
Common Pitfalls
- Dynamic content: Many e-commerce sites use JavaScript. Consider Selenium for SPA sites
- Anti-bot measures: Don't ignore CAPTCHAs or rate limiting
- Outdated selectors: Website layouts change. Build flexible selectors
- Traffic API costs: Third-party APIs can be expensive. Budget accordingly
- Currency confusion: Always identify and convert currencies consistently
- Incomplete data: Not all sites expose full product catalogs easily