AI Skill Report Card
Web Scraping Analysis
Web Scraping and Analysis
Quick Start15 / 15
Pythonimport requests from bs4 import BeautifulSoup import pandas as pd # Basic web scraping url = "https://example.com" response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') # Extract specific elements titles = soup.find_all('h2', class_='title') data = [title.get_text().strip() for title in titles] print(data)
Recommendation▾
Add error handling patterns and retry mechanisms in the workflow section
Workflow13 / 15
Progress:
- Identify target website and data requirements
- Choose scraping method (requests vs selenium)
- Inspect HTML structure and identify selectors
- Handle rate limiting and headers
- Extract and clean data
- Store results in structured format
Step-by-step Process:
-
Setup and Headers
Pythonheaders = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' } session = requests.Session() session.headers.update(headers) -
Handle Dynamic Content (if needed)
Pythonfrom selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait driver = webdriver.Chrome() driver.get(url) WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.CLASS_NAME, "target-class")) ) -
Extract and Structure Data
Pythonresults = [] for item in soup.find_all('div', class_='item'): data = { 'title': item.find('h3').get_text().strip(), 'price': item.find('span', class_='price').get_text(), 'url': item.find('a')['href'] } results.append(data) df = pd.DataFrame(results)
Recommendation▾
Include concrete input/output examples showing actual scraped data structure rather than just code snippets
Examples15 / 20
Example 1: Input: Scrape product listings from e-commerce site Output:
Pythonproducts = [] for product in soup.find_all('div', class_='product-item'): products.append({ 'name': product.find('h4').text.strip(), 'price': product.find('span', class_='price').text, 'rating': len(product.find_all('i', class_='star-filled')) })
Example 2: Input: Extract news headlines with timestamps Output:
Pythonarticles = [] for article in soup.select('article.news-item'): articles.append({ 'headline': article.select_one('h2').text.strip(), 'timestamp': article.select_one('time')['datetime'], 'summary': article.select_one('.summary').text.strip() })
Recommendation▾
Add templates for common scraping patterns (pagination, form submission, authentication) to improve completeness
Best Practices
- Always check robots.txt and respect rate limits
- Use sessions for multiple requests to same domain
- Implement exponential backoff for failed requests
- Cache responses when possible to avoid redundant requests
- Use CSS selectors for precise element targeting
- Handle encoding issues with proper charset detection
- Store raw HTML for debugging complex parsing issues
Common Pitfalls
- Don't scrape too aggressively - implement delays between requests
- Don't ignore HTTP status codes - handle 404s, 403s properly
- Don't assume HTML structure is consistent across pages
- Don't forget to close selenium drivers to avoid memory leaks
- Don't hardcode selectors without fallback options
- Don't ignore JavaScript-rendered content when present
- Don't scrape without checking if an API exists first