AI Skill Report Card
Web Scraping
Quick Start15 / 15
Pythonfrom bs4 import BeautifulSoup import requests # Fetch and parse a web page url = "https://example.com" response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') # Extract specific elements title = soup.find('title').text links = [a['href'] for a in soup.find_all('a', href=True)]
Recommendation▾
Add concrete input/output examples showing actual HTML snippets and extracted results rather than just code patterns
Workflow12 / 15
- Send HTTP Request - Fetch the web page content
- Parse HTML - Use BeautifulSoup to create a navigable tree
- Extract Data - Target specific elements using selectors
- Clean Data - Remove unwanted characters and normalize text
- Store Results - Save to file or database
Progress:
- Identify target elements
- Write CSS selectors
- Test extraction logic
- Handle edge cases
- Validate output
Recommendation▾
Include error handling templates and common HTTP status code scenarios in the workflow
Examples15 / 20
Example 1: Extract All Links Input: HTML page with multiple anchor tags Output:
Pythonlinks = soup.find_all('a') urls = [link.get('href') for link in links if link.get('href')]
Example 2: Extract Text Content
Input: <div class="content">Hello World</div>
Output:
Pythoncontent = soup.find('div', class_='content').text.strip() # Result: "Hello World"
Example 3: Extract Table Data Input: HTML table Output:
Pythontable = soup.find('table') rows = [[cell.text.strip() for cell in row.find_all(['td', 'th'])] for row in table.find_all('tr')]
Recommendation▾
Provide a complete working example that demonstrates the full pipeline from URL to cleaned data output
Best Practices
- Respect robots.txt - Check site's crawling policies
- Add delays - Use
time.sleep()between requests - Handle errors - Wrap requests in try-catch blocks
- Use headers - Set User-Agent to avoid blocking
- Parse incrementally - Process large pages in chunks
- Cache responses - Store HTML locally to avoid re-fetching
Common Pitfalls
- Don't scrape without checking Terms of Service
- Don't ignore rate limiting - sites may block aggressive scrapers
- Don't assume elements exist - always check with
.find()before accessing - Don't ignore encoding issues - specify encoding when parsing
- Don't scrape dynamic content without JavaScript rendering (use Selenium instead)