2013-01-23 64 views
5

Tôi có một trang HTML với nhiều divs nhưPython: Làm cách nào để trích xuất URL từ Trang HTML bằng cách sử dụng BeautifulSoup?

<div class="article-additional-info"> 
A peculiar situation arose in the Supreme Court on Tuesday when two lawyers claimed to be the representative of one of the six accused in the December 16 gangrape case who has sought shifting of t... 
<a class="more" href="http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece"> 
<span class="arrows">»</span> 
</a> 
</div> 

<div class="article-additional-info"> 
Power consumers in the city will have to brace for spending more on their monthly bills as all three power distribution companies – the Anil Ambani-owned BRPL and BYPL and the Tatas-owned Tata Powe... 
<a class="more" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece"> 
<a class="commentsCount" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments"> 
</div> 

và tôi cần để có được giá trị <a href=> cho tất cả các divs với lớp article-additional-info Tôi mới vào BeautifulSoup

vì vậy tôi cần các url

"http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece" 
"http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece" 

Cách tốt nhất để đạt được điều này là gì?

Trả lời

8

Theo tiêu chí của bạn, nó trả về ba URL (không phải hai) - bạn có muốn lọc thứ ba không?

ý tưởng cơ bản là để lặp qua HTML, kéo ra chỉ có những yếu tố trong lớp học của bạn, và sau đó lặp qua tất cả các liên kết trong lớp đó, kéo ra các liên kết thực tế:

In [1]: from bs4 import BeautifulSoup 

In [2]: html = # your HTML 

In [3]: soup = BeautifulSoup(html) 

In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}): 
    ...:  for link in item.find_all('a'): 
    ...:   print link.get('href') 
    ...:   
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece 
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece 
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments 

này giới hạn của bạn chỉ tìm kiếm các phần tử đó bằng thẻ lớp article-additional-info và bên trong sẽ tìm tất cả các thẻ neo (a) và lấy liên kết href tương ứng của chúng.

2
from bs4 import BeautifulSoup as BS 
html = # Your HTML 
soup = BS(html) 
for text in soup.find_all('div', class_='article-additional-info'): 
    for links in text.find_all('a'): 
     print links.get('href') 

nào in:

http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments 
2

Sau khi làm việc với các tài liệu, tôi đã làm nó theo cách sau, cảm ơn tất cả các bạn cho câu trả lời của bạn, tôi đánh giá cao họ

>>> import urllib2 
>>> f = urllib2.urlopen('http://www.thehindu.com/news/cities/delhi/?union=citynews') 
>>> soup = BeautifulSoup(f.fp) 
>>> for link in soup.select('.article-additional-info'): 
... print link.find('a').attrs['href'] 
... 
http://www.thehindu.com/news/cities/Delhi/airport-metro-express-is-back/article4335059.ece 
http://www.thehindu.com/news/cities/Delhi/91-more-illegal-colonies-to-be-regularised/article4335069.ece 
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece 
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece 
http://www.thehindu.com/news/cities/Delhi/nurses-women-groups-demand-safety-audit-of-workplaces/article4331470.ece 
http://www.thehindu.com/news/cities/Delhi/delhi-bpl-families-to-get-12-subsidised-lpg-cylinders/article4328990.ece 
http://www.thehindu.com/news/cities/Delhi/shias-condemn-violence-against-religious-minorities/article4328276.ece 
http://www.thehindu.com/news/cities/Delhi/new-archbishop-of-delhi-takes-over/article4328284.ece 
http://www.thehindu.com/news/cities/Delhi/delhi-metro-to-construct-subway-without-disrupting-traffic/article4328290.ece 
http://www.thehindu.com/life-and-style/Food/going-for-the-kill-in-patparganj/article.ece 
http://www.thehindu.com/news/cities/Delhi/fire-at-janpath-bhavan/article4335068.ece 
http://www.thehindu.com/news/cities/Delhi/fiveyearold-girl-killed-as-school-van-overturns/article4335065.ece 
http://www.thehindu.com/news/cities/Delhi/real-life-stories-of-real-women/article4331483.ece 
http://www.thehindu.com/news/cities/Delhi/women-councillors-allege-harassment-by-male-councillors-of-rival-parties/article4331471.ece 
http://www.thehindu.com/news/cities/Delhi/airport-metro-resumes-today/article4331467.ece 
http://www.thehindu.com/news/national/hearing-today-on-plea-to-shift-trial/article4328415.ece 
http://www.thehindu.com/news/cities/Delhi/protestors-demand-change-in-attitude-of-men-towards-women/article4328277.ece 
http://www.thehindu.com/news/cities/Delhi/bjp-promises-5-lakh-houses-for-poor-on-interestfree-loans/article4328280.ece 
http://www.thehindu.com/life-and-style/metroplus/papad-bidi-and-a-dacoit/article4323219.ece 
http://www.thehindu.com/life-and-style/Food/gharana-of-food-not-just-music/article4323212.ece 
>>> 
0
In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}): 
...:  for link in item.find_all('a'): 
...:   print link.get('href') 
...: 
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments 
+1

Hãy không liên kết lại với trang web của riêng bạn, đó là [** spam **] (http://stackoverflow.com/help/promotion) cho [như vậy]. –

Các vấn đề liên quan