Chỉ cần cố gắng tháo rời và cố gắng làm cho một con nhện cơ bản hoạt động. Tôi biết đây chỉ là một cái gì đó tôi đang thiếu nhưng tôi đã thử tất cả mọi thứ tôi có thể nghĩ đến.Phế liệu HtmlXPathSelector

Các lỗi tôi nhận được là:

line 11, in JustASpider 
    sites = hxs.select('//title/text()') 
NameError: name 'hxs' is not defined

Mã của tôi là rất cơ bản vào lúc này, nhưng tôi vẫn dường như không thể tìm thấy nơi tôi sẽ sai. Cảm ơn vì bất kì sự giúp đỡ!

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 

class JustASpider(BaseSpider): 
    name = "google.com" 
    start_urls = ["http://www.google.com/search?hl=en&q=search"] 


    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//title/text()') 
     for site in sites: 
      print site.extract() 


SPIDER = JustASpider()

Nguồn

2012-09-03 Keanan Koppenhaver

Làm thế nào để bạn chạy nhện? 'scrapy crawl" google.com "'? – Leo

Không có gì sai với mã của bạn (ngoài việc không cần phải khai báo SPIDER nữa), nó hoạt động cho tôi. –

@Leo Đó là cách tôi đã chạy nó. –

Tôi đã xóa cuộc gọi SPIDER ở cuối và xóa vòng lặp for. Chỉ có một thẻ tiêu đề (như người ta mong đợi) và có vẻ như đã ném ra khỏi vòng lặp. Mã Tôi đã làm việc như sau:

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 

class JustASpider(BaseSpider): 
    name = "google.com" 
    start_urls = ["http://www.google.com/search?hl=en&q=search"] 


    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     titles = hxs.select('//title/text()') 
     final = titles.extract()

Nguồn

2012-09-10 16:27:07

mã của bạn hoạt động, nhưng tốt hơn nên sử dụng tên đơn giản cho các trình thu thập thông tin, như "google" hoặc "googleSpider" thay vì "google.com" – parik

Đảm bảo bạn đang chạy mã bạn đang hiển thị cho chúng tôi.

Thử xóa *.pyc tệp trong dự án của bạn.

Nguồn

2012-09-05 04:47:16 warvariuc

Sau khi xóa tất cả các tệp pyc trong thư mục, tôi vẫn gặp lỗi tương tự. Nếu tôi bị thiếu phụ thuộc, tôi có bị lỗi nhập không? –

kiểm tra thụt đầu dòng trong mã của bạn. có thể bạn trộn các tab với không gian? – warvariuc

Tôi đã có một vấn đề tương tự, NameError: name 'hxs' is not defined, và các vấn đề liên quan đến không gian và các tab: IDE sử dụng không gian thay vì tab, bạn nên kiểm tra xem nó ra.

Nguồn

2013-01-23 23:22:51

này làm việc cho tôi:

Lưu tập tin như test.py
Sử dụng lệnh scrapy runspider <filename.py>

Ví dụ:

scrapy runspider test.py

Nguồn

2013-08-19 15:01:00

Mã có vẻ đúng.

Trong các phiên bản mới nhất của Scrapy
HtmlXPathSelector không còn được dùng nữa. Sử dụng công cụ chọn:

hxs = Selector(response) 
sites = hxs.xpath('//title/text()')

Nguồn

2014-02-14 05:14:58 dimka665

đây chỉ là bản trình diễn nhưng nó hoạt động. cần phải được tùy chỉnh offcourse. !

/usr/bin/env python

từ scrapy.spider nhập khẩu BaseSpider từ HtmlXPathSelector scrapy.selector nhập khẩu

lớp DmozSpider (BaseSpider): name = "DMOZ" allowed_domains = [" dmoz.org "] start_urls = [ " http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ]

def parse(self, response): 
    hxs = HtmlXPathSelector(response) 
    sites = hxs.select('//ul/li') 
    for site in sites: 
     title = site.select('a/text()').extract() 
     link = site.select('a/@href').extract() 
     desc = site.select('text()').extract() 
     print title, link, desc

Nguồn

2014-06-21 19:20:27 user3672836

Bạn nên thay đổi

from scrapy.selector import HtmlXPathSelector

vào

from scrapy.selector import Selector

Và sử dụng hxs=Selector(response) để thay thế.

Nguồn

2015-04-26 05:38:32 neal

Mã trông có phiên bản khá cũ. Tôi khuyên bạn nên sử dụng các mã thay vì

from scrapy.spider import Spider 
 
from scrapy.selector import Selector 
 

 
class JustASpider(Spider): 
 
    name = "googlespider" 
 
    allowed_domains=["google.com"] 
 
    start_urls = ["http://www.google.com/search?hl=en&q=search"] 
 

 

 
    def parse(self, response): 
 
     sel = Selector(response) 
 
     sites = sel.xpath('//title/text()').extract() 
 
     print sites 
 
     #for site in sites: (I dont know why you want to loop for extracting the text in the title element) 
 
      #print site.extract()

hy vọng nó giúp và here là một ví dụ tốt để làm theo.

Nguồn

2015-09-04 06:28:46

Tôi sử dụng Scrapy với BeautifulSoup4.0. Đối với tôi, Soup rất dễ đọc và dễ hiểu. Đây là một tùy chọn nếu bạn không phải sử dụng HtmlXPathSelector. Hi vọng điêu nay co ich!

import scrapy 
from bs4 import BeautifulSoup 
import Item 

def parse(self, response): 

    soup = BeautifulSoup(response.body,'html.parser') 
    print 'Current url: %s' % response.url 
    item = Item() 
    for link in soup.find_all('a'): 
     if link.get('href') is not None: 
      url = response.urljoin(link.get('href')) 
      item['url'] = url 
      yield scrapy.Request(url,callback=self.parse) 
      yield item

Nguồn

2016-10-11 19:13:57 sarc360

Phế liệu HtmlXPathSelector

Trả lời

/usr/bin/env python

Các vấn đề liên quan