Cách thay thế cho phương pháp regex là sử dụng trình phân tích cú pháp Javascript, chuyển đổi kết quả đầu ra của trình phân tích cú pháp đó thành tài liệu XML và phân tích nó bằng XPath.
Đó là những gì thực hiện trong js2xml, trong đó sử dụng slimit
và lxml
(từ chối trách nhiệm: Tôi đã viết js2xml; cảnh báo: không ổn định)
Trong trường hợp của bạn, kiểm tra mẫu phiên scrapy vỏ này, sử dụng js2xml.jsonlike.getall()
:
paul:~$ scrapy shell http://2loom.com/products/2loom-design-siyah-beyaz-kalpli
2014-05-19 16:12:00+0200 [scrapy] INFO: Scrapy 0.23.0 started (bot: scrapybot)
2014-05-19 16:12:00+0200 [scrapy] INFO: Optional features available: ssl, http11
2014-05-19 16:12:00+0200 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled item pipelines:
2014-05-19 16:12:00+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-05-19 16:12:00+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-05-19 16:12:00+0200 [default] INFO: Spider opened
2014-05-19 16:12:01+0200 [default] DEBUG: Crawled (200) <GET http://2loom.com/products/2loom-design-siyah-beyaz-kalpli> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7f8552946610>
[s] item {}
[s] request <GET http://2loom.com/products/2loom-design-siyah-beyaz-kalpli>
[s] response <200 http://2loom.com/products/2loom-design-siyah-beyaz-kalpli>
[s] settings <CrawlerSettings module=None>
[s] spider <Spider 'default' at 0x7f8552384b90>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
/usr/local/lib/python2.7/dist-packages/IPython/frontend.py:30: UserWarning: The top-level `frontend` package has been deprecated. All its subpackages have been moved to the top `IPython` level.
warn("The top-level `frontend` package has been deprecated. "
In [1]: scripts = response.selector.xpath('//script/text()').extract()
In [2]: import js2xml, js2xml.jsonlike
In [3]: js = js2xml.parse(scripts[-1])
In [4]: js2xml.jsonlike.getall(js)
Out[4]:
[{'onVariantSelected': 'selectCallback',
'product': {'available': True,
'compare_at_price': None,
'compare_at_price_max': 0,
'compare_at_price_min': 0,
'compare_at_price_varies': False,
'content': u'<blockquote>Siyah-beyaz kalpli tulumlarimiz 100% polyester olup kap\u015fonun i\xe7i ve ribanas\u0131 lacivertir. Fermuar\u0131 iki tarafl\u0131 a\xe7\u0131l\u0131r kapan\u0131r olup kap\u015fonun tamam\u0131n\u0131 kapsar ve beyaz renklidir. Tulumlar\u0131n her iki taraf\u0131ndaki cepler\xa0 beyaz fermuarl\u0131 ve elcikler siyaht\u0131r. Ayr\u0131ca kar\u0131n bolgesinde cepler vard\u0131r Tulumlardaki logolar beyazd\u0131r. Kad\u0131nlar ve erkekler i\xe7in tasarlanm\u0131\u015ft\u0131r.</blockquote>',
'created_at': '2013-11-29T13:37:11+02:00',
'description': u'<blockquote>Siyah-beyaz kalpli tulumlarimiz 100% polyester olup kap\u015fonun i\xe7i ve ribanas\u0131 lacivertir. Fermuar\u0131 iki tarafl\u0131 a\xe7\u0131l\u0131r kapan\u0131r olup kap\u015fonun tamam\u0131n\u0131 kapsar ve beyaz renklidir. Tulumlar\u0131n her iki taraf\u0131ndaki cepler\xa0 beyaz fermuarl\u0131 ve elcikler siyaht\u0131r. Ayr\u0131ca kar\u0131n bolgesinde cepler vard\u0131r Tulumlardaki logolar beyazd\u0131r. Kad\u0131nlar ve erkekler i\xe7in tasarlanm\u0131\u015ft\u0131r.</blockquote>',
'featured_image': '//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_girls.jpg?v=1389259261',
'handle': '2loom-design-siyah-beyaz-kalpli',
'id': 185310341,
'images': ['//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_girls.jpg?v=1389259261',
'//cdn.shopify.com/s/files/1/0305/9953/products/6._Zwarte_hartjes_ak_girls.jpg?v=1389259259',
'//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_boys.jpg?v=1389259264',
'//cdn.shopify.com/s/files/1/0305/9953/products/6._Zwartje_hartjes_ak_boys.jpg?v=1389259264'],
'options': ['Size'],
'price': 15900,
'price_max': 15900,
'price_min': 15900,
'price_varies': False,
'published_at': '2013-11-29T13:34:20+02:00',
'tags': [u'2\xb7Loom',
'Beyaz',
'Design',
'Ekrek',
u'Kad\u0131n',
'Kalpli',
'Lacivert'],
'title': '10. Design | Siyah & beyaz kalpli',
'type': '2 Loom Limiteds',
'variants': [{'available': True,
'barcode': None,
'compare_at_price': None,
'id': 424584985,
'inventory_management': 'shopify',
'inventory_policy': 'deny',
'inventory_quantity': 3,
'option1': 'XS (34-36: 1.60m-1.70m)',
'option2': None,
'option3': None,
'options': ['XS (34-36: 1.60m-1.70m)'],
'price': 15900,
'requires_shipping': True,
'sku': 'T01-BLWH-1-XS',
'taxable': True,
'title': 'XS (34-36: 1.60m-1.70m)',
'weight': 0},
{'available': True,
'barcode': None,
'compare_at_price': None,
'id': 424584989,
'inventory_management': 'shopify',
'inventory_policy': 'deny',
'inventory_quantity': 3,
'option1': 'S (36-38: 1.65m-1.75m)',
'option2': None,
'option3': None,
'options': ['S (36-38: 1.65m-1.75m)'],
'price': 15900,
'requires_shipping': True,
'sku': 'T01-BLWH-1-S',
'taxable': True,
'title': 'S (36-38: 1.65m-1.75m)',
'weight': 0},
{'available': True,
'barcode': None,
'compare_at_price': None,
'id': 424584997,
'inventory_management': 'shopify',
'inventory_policy': 'deny',
'inventory_quantity': 7,
'option1': 'M (38-40: 1.70m-1.80m)',
'option2': None,
'option3': None,
'options': ['M (38-40: 1.70m-1.80m)'],
'price': 15900,
'requires_shipping': True,
'sku': 'T01-BLWH-1-M',
'taxable': True,
'title': 'M (38-40: 1.70m-1.80m)',
'weight': 0},
{'available': True,
'barcode': None,
'compare_at_price': None,
'id': 424585001,
'inventory_management': 'shopify',
'inventory_policy': 'deny',
'inventory_quantity': 7,
'option1': 'L (40-42: 1.75m-1.85m)',
'option2': None,
'option3': None,
'options': ['L (40-42: 1.75m-1.85m)'],
'price': 15900,
'requires_shipping': True,
'sku': 'T01-BLWH-1-L',
'taxable': True,
'title': 'L (40-42: 1.75m-1.85m)',
'weight': 0}],
'vendor': u'2\xb7Loom'}}]
In [5]:
kịch bản của tôi trên
tôi sẽ sử dụng 'id = sel.xpath ('// cơ thể') re (r ' "id": (\ d +)').' là nó có đúng không? –tôi nhận được tập lệnh đó từ các yếu tố cơ thể. và tôi đã sử dụng: 'id = re.search ('" id ": (\ d +)', sel.xpath (" // body/text() "). extract()). group (1)' nhưng có lỗi –
@MuhammetArslan như tôi đã lưu ý trong câu trả lời, sử dụng 'sel.xpath ('// body/text()'). (r '" id ": (\ d +)')'. – alecxe