Summary of today's content

  • selenium Use
  • The coding platform uses
  • xpath Use
  • Get the commodity information of Jingdong
  • scrapy Introduction and installation

Detailed content

1、selenium Use of modules

#  Before we learned requests, Can send http request , But some pages are made of render+ajax Rendered , If only requestes, It can only execute render Request , Get the data back , perform ajax Request , You need to analyze , Then send a request 

#  Use selenium, Control browser , Operating the browser , Complete human behavior --> Automated test tool 

#  The essence is python Pass code , With the help of browser driver , Operating the browser   Really realized , You can climb when you can see it 

#  Download module :
pip3 install selenium # Download relevant browser drivers :IE, firefox , Google ( Recommend to use )
# Google drives :
https://registry.npmmirror.com/binary.html?path=chromedriver/
It must correspond to the browser version
100.0.4896.127----》 The drive should also correspond to https://registry.npmmirror.com/binary.html?path=chromedriver/101.0.4951.41/
If there is no specific version, find the closest one
Put the driver under the project

1.0 Basic use

from selenium import webdriver
import time # Open a browser with code
# bro = webdriver.Chrome(executable_path='./chromedriver') # mac linux
bro = webdriver.Chrome(executable_path='chromedriver.exe') # win # Enter the address in the address bar
bro.get('https://www.baidu.com') # Find the input box
search = bro.find_element_by_id('kw') # Enter beauty in the input box
search.send_keys(" beauty ") # Find Baidu button
button = bro.find_element_by_id('su')
# Click on the button
button.click() time.sleep(2)
print(bro.page_source) # Current page html Content
with open('baidu.html', 'w', encoding='utf-8') as f:
f.write(bro.page_source) # contain redner+ajax bro.close()

1.1 Headless browser

#  Be a reptile , Open the browser you don't want to display , however selenium You have to use a browser , Let the browser not show , Background operation , Finish the reptile 
from selenium import webdriver

from selenium.webdriver.chrome.options import Options

#  Get a configuration object 
chrome_options = Options()
chrome_options.add_argument('window-size=1920x3000') # Specify browser resolution
chrome_options.add_argument('--disable-gpu') # Google Documents mention the need to add this attribute to circumvent bug
chrome_options.add_argument('--hide-scrollbars') # Hide scroll bar , Deal with special pages
chrome_options.add_argument('blinfk-settings=imagesEnabled=alse') # Don't load pictures , Speed up
chrome_options.add_argument('--headless') # Browsers don't provide visual pages . linux Next, if the system does not support visualization, it will fail to start bro = webdriver.Chrome(executable_path='./chromedriver', options=chrome_options) bro.get('http://www.cnblogs.com') print(bro.page_source)
bro.close()

1.2 Get element location , attribute , size

#  General verification code cracking 
Add : Label location and size :size and location # Generally used to deduct verification code pictures : There may be inconsistent pictures due to resolution problems ---》 By modifying the resolution --》 Achieve correct matting # The verification code is img---》src--》 You can get the verification code by loading it yourself , Save it locally (requests)--> It's simpler print(tag.id) # id, But not the label id,selenium One of the id
print(tag.location) # Location
print(tag.tag_name) # Tag name
print(tag.size) # The size of the label
from selenium import webdriver
import time
from PIL import Image bro = webdriver.Chrome(executable_path='./chromedriver') # mac linux # Enter the address in the address bar
bro.get('https://www.jd.com/')
bro.implicitly_wait(10) # Find pictures
img = bro.find_element_by_css_selector('a.logo_tit_lk')
# print(img.location) # Picture location {'x': 105, 'y': 41}
# print(img.size) # Picture size This picture can be uniquely determined by its position and size , Through the screenshot, you can cut out the picture
# print(img.id) # selenium Provided id Number , Ignore
# print(img.tag_name) # a location = img.location
size = img.size
bro.save_screenshot('./main.png') # Save the whole page as a picture # pillow Cutout , Pull out the icon
# The first parameter Start taking screenshots x coordinate
# The second parameter Start taking screenshots y coordinate
# The third parameter End of screenshot x coordinate
# Fourth parameter End of screenshot y coordinate
img_tu = (
int(location['x']), int(location['y']), int(location['x'] + size['width']), int(location['y'] + size['height'])) # # Use pillow Open the screenshot
img = Image.open('./main.png')
# Deduct the verification code from the screenshot according to the position
code_img = img.crop(img_tu)
# Take out the picture , Save to local
code_img.save('./code.png') # Parameter description
bro.close() ## Add : Label location and size :size and location
# Generally used to deduct verification code pictures : There may be inconsistent pictures due to resolution problems ---》 By modifying the resolution --》 Achieve correct matting
# The verification code is img---》src--》 You can get the verification code by loading it yourself , Save it locally (requests)--> It's simpler

1.3 Wait for the element to be loaded

#  Code operation , Very fast , Maybe the tag hasn't been loaded yet , The code goes to get the label operation , So I can't find the label , Report errors 

#  Wait until the tag loading is completed 
According to wait : Each tag must write a wait logic An implicit wait : Any tag to be taken follows this logic , Just write it once ( Recommend to use ) bro.implicitly_wait(10) # Take this label , If you can't get it, wait , Until the tag is loaded or 10s here we are

1.4 Element operation

from selenium import webdriver
import time bro = webdriver.Chrome(executable_path='./chromedriver') # mac linux bro.get('https://www.baidu.com/')
bro.implicitly_wait(10) # An implicit wait # How to find labels :# selenium:find_element_by_xx,find_elements_by_xx
# 1、find_element_by_id # adopt id look for
# 2、find_element_by_link_text # adopt a Label text
# 3、find_element_by_partial_link_text # adopt a The label text is blurred
# 4、find_element_by_tag_name # Find... By tag name
# 5、find_element_by_class_name # Find... By class name
# 6、find_element_by_name # adopt name Attribute finding
# 7、find_element_by_css_selector # css Selectors
# 8、find_element_by_xpath # xpath Selectors # lookup a The tag text content is logged in
login_a = bro.find_element_by_link_text(' Sign in ')
# login_a = bro.find_element_by_link_id('s-top-loginbtn')
# Click on a label
login_a.click() ## Find the account and log in , Click on
login_pwd_btn = bro.find_element_by_id('TANGRAM__PSP_11__changePwdCodeItem')
login_pwd_btn.click() # Find the input box of user name and password --》 Enter the username and password
username = bro.find_element_by_name('userName')
pwd = bro.find_element_by_css_selector('#TANGRAM__PSP_11__password')
username.send_keys("[email protected]")
pwd.send_keys('lqz12345678')
time.sleep(3)
username.clear()
username.send_keys("[email protected]")
submit = bro.find_element_by_id('TANGRAM__PSP_11__submit')
submit.click()
# Pop up verification code identification --》 You can manually click # Landing successful
time.sleep(5) bro.close() # It's getting harder and harder to log in automatically ---》 Understand the purpose of logging in ?---》
# Get cookie Send a request , The Navy automatically replied , vote , give the thumbs-up , Comment on --》 After semi-automatic login --》
# Fetch cookie, build cookie pool --》 Each use requests Send a request , Auto comment , vote , carry cookie # send_keys click clear
# To find the way

1.5 perform js

#  Commonly used , Execute directly in the local page js Code 
# Case one , Controls the sliding of the operation page
# The second case : Use some variables in the current page , Execute the functions in the page
from selenium import webdriver
import time bro = webdriver.Chrome(executable_path='./chromedriver') # mac linux bro.get('https://www.pearvideo.com/category_9')
bro.implicitly_wait(10) # An implicit wait # bro.execute_script("alert('hello')") # Case one , Controls the sliding of the operation page
# bro.execute_script('window.scrollBy(0, document.body.scrollHeight)')
# time.sleep(1)
# bro.execute_script('window.scrollBy(0, document.body.scrollHeight)')
# time.sleep(1)
# bro.execute_script('window.scrollBy(0, document.body.scrollHeight)') # The second case : Use some variables in the current page , Execute the functions in the page
# bro.execute_script('alert(md5_vm_test())')
# bro.execute_script('alert(urlMap)') time.sleep(5) bro.close() # Control when the browser opens 5 Seconds off

1.6 Toggle tab

import time
from selenium import webdriver browser = webdriver.Chrome(executable_path='./chromedriver')
browser.get('https://www.baidu.com') # Open the tab
browser.execute_script('window.open()') print(browser.window_handles) # Get all the tabs
browser.switch_to.window(browser.window_handles[1])
browser.get('https://www.taobao.com')
time.sleep(2)
browser.switch_to.window(browser.window_handles[0])
browser.get('https://www.sina.com.cn')
browser.close() # Close current tab
browser.quit() # Exit browser

1.7 Simulate forward and backward

import time
from selenium import webdriver browser = webdriver.Chrome(executable_path='./chromedriver')
browser.get('https://www.baidu.com')
browser.get('https://www.taobao.com')
browser.get('http://www.sina.com.cn/') browser.back()
time.sleep(2)
browser.forward()
browser.close()

1.8 exception handling

from selenium import webdriver
from selenium.common.exceptions import TimeoutException, NoSuchElementException, NoSuchFrameException try:
browser = webdriver.Chrome()
browser.get('http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable')
browser.switch_to.frame('iframssseResult') except TimeoutException as e:
print(e)
except NoSuchFrameException as e:
print(e)
finally:
browser.close()

1.9 selenium Sign in cnblogs obtain cookie

#  First use selenium  Semi automatic login to cnblogs----》 Take out cookie Save locally 

#  Next time use selenium  visit cnblogs--》 Before loading cookie---》 Into a login status 
from selenium import webdriver
import json
import time bro = webdriver.Chrome(executable_path='./chromedriver') # Landing access cookie The process of
# try:
# bro.get('http://www.cnblogs.com')
# bro.implicitly_wait(10)
# submit_a=bro.find_element_by_link_text(' Sign in ')
# submit_a.click()
# username=bro.find_element_by_id('mat-input-0')
# password=bro.find_element_by_id('mat-input-1')
# username.send_keys('[email protected]')
# password.send_keys('lqz123') # Manual input
#
#
#
# # submit=bro.find_element_by_class_name('mat-button-wrapper')
# # submit.click()
# input() # Enter the password manually , Click login , The verification code passed , And then hit enter , Keep going down
# # Pop up the verification code ---》 It's hard to break ---> Manual operation ---》
# # Login succeeded
# # hold cookie Save to local
# # print(bro.get_cookies())
#
# with open('cnblogs.json','w',encoding='utf-8') as f:
# json.dump(bro.get_cookies(),f)
#
#
# except Exception as e:
# print(e)
# finally:
# bro.close() # Access write cookie
try:
bro.get('http://www.cnblogs.com')
bro.implicitly_wait(10)
# Write local cookie
with open('cnblogs.json', 'r', encoding='utf-8') as f:
cookie_dic = json.load(f) # Write to browser
# bro.add_cookie(cookie_dic)
for item in cookie_dic: # Set up cookie You must use a dictionary ,cookie Of json The file is a list , So put it in a loop
bro.add_cookie(item) bro.refresh() # Refresh the browser
time.sleep(2) except Exception as e:
print(e)
finally:
bro.close()

1.10 Drawer semi-automatic like

# ( Pure automatic login , It's hard to get on ) Use selenium Semi automatic login ---》 You can log in a lot of trumpets ---》 Get cookie Save to redis( Save to local )

#  Reuse requests+cookie Somewhere in the pool cookie---》 Brush comments , Praise 
from selenium import webdriver
import time
import json # # 1 Log in first , Fetch cookie, Save locally
# bro = webdriver.Chrome(executable_path='./chromedriver')
# try:
# bro.get('https://dig.chouti.com/')
# submit_btn = bro.find_element_by_id('login_btn')
# submit_btn.click() # If you make a mistake , Use the following sentence
# # bro.execute_script('arguments[0].click();', submit_btn) # Use js Click on
#
# username = bro.find_element_by_name('phone')
# pwd = bro.find_element_by_name('password')
# username.send_keys('18953675221')
# pwd.send_keys('lqz123----')
#
# submit = bro.find_element_by_css_selector(
# 'body > div.login-dialog.dialog.animated2.scaleIn > div > div.login-footer > div:nth-child(4) > button')
#
# time.sleep(2)
# submit.click()
# # Output verification code
# input()
#
# with open('chouti.json', 'w', encoding='utf-8') as f:
# json.dump(bro.get_cookies(), f)
#
# time.sleep(3)
#
# except Exception as e:
# print(e)
# finally:
# bro.close() # # 2 Use requests Auto like ---》requests It can be multithreaded , A fast batch , If you use selenium Operating the browser , Can't multithread , Eat a lot of memory
import requests
from bs4 import BeautifulSoup header = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36'
}
res = requests.get('https://dig.chouti.com/', headers=header)
# print(res.text)
soup = BeautifulSoup(res.text, 'lxml')
div_list = soup.find_all(class_='link-item')
for div in div_list:
article_id = div.attrs.get('data-id')
print(article_id)
if article_id:
data = {
'linkId': article_id
} # cookie write in
cookie = {}
with open('chouti.json', 'r') as f:
res = json.load(f)
for item in res:
# selenium Of cookie and requests Module used cookie Not quite the same. ,requests as long as name and value
cookie[item['name']] = item['value']
res = requests.post('https://dig.chouti.com/link/vote', headers=header, data=data, cookies=cookie)
print(res.text) # data = {
# 'linkId': '34976644'
# }
# res = requests.post('https://dig.chouti.com/link/vote', headers=header, data=data)
# print(res)

2、 The coding platform uses

#  Third party platform cracking verification code ---》 Pay for services , Give the picture to others , Someone will help you untie , Come back 
# Cloud code , The super eagles # The super eagles
Developing documents :python Sample code
The application case : Reptile collection technology , Automatic intelligence 、 Artificial ( Aunt cracked --》 Pass it on to you ) Multiple modes
The price system :1 element =1000 branch
import requests
from hashlib import md5 class ChaojiyingClient(): def __init__(self, username, password, soft_id):
self.username = username
password = password.encode('utf8')
self.password = md5(password).hexdigest()
self.soft_id = soft_id
self.base_params = {
'user': self.username,
'pass2': self.password,
'softid': self.soft_id,
}
self.headers = {
'Connection': 'Keep-Alive',
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
} def PostPic(self, im, codetype):
"""
im: Picture byte
codetype: Topic type Reference resources http://www.chaojiying.com/price.html
"""
params = {
'codetype': codetype,
}
params.update(self.base_params)
files = {'userfile': ('ccc.jpg', im)}
r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
headers=self.headers)
return r.json() def PostPic_base64(self, base64_str, codetype):
"""
im: Picture byte
codetype: Topic type Reference resources http://www.chaojiying.com/price.html
"""
params = {
'codetype': codetype,
'file_base64': base64_str
}
params.update(self.base_params)
r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, headers=self.headers)
return r.json() def ReportError(self, im_id):
"""
im_id: Picture of the wrong title ID
"""
params = {
'id': im_id,
}
params.update(self.base_params)
r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
return r.json() if __name__ == '__main__':
chaojiying = ChaojiyingClient('306334678', 'lqz12345', '903641') # User center >> Software ID Generate a replacement 96001
im = open('./b.png', 'rb').read() # Local image file path To replace a.jpg Sometimes WIN The system needs //
print(chaojiying.PostPic(im, 6001)) # 1902 Verification code type Official website >> The price system 3.4+ edition print After ()
# print chaojiying.PostPic(base64_str, 1902) # Here is the incoming base64 Code

3、xpath Use

# css  and  xpath  Own 
# xpath:XML Path to the language (XML Path Language), It's a way to determine XML The language of a part of a document . # Select the current node
.. # Select the parent of the current node
/ # Represents the current path
// # Represents any path , Generation after generation of descendants
nodename # a img p Node name ## give an example
// div # //div At present html Find... Under any path div
/ div # Only the ones on this floor div * # Any label
@href # Take the attribute of this tag
/ text() # Get the text of the tag
doc = '''
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div id='images'>
<a href='image1.html' id='id_a'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
<a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
<a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
<a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
<a href='image5.html' class='li li-item' name='items'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
<a href='image6.html' name='items'><span><h5>test</h5></span>Name: My image 6 <br /><img src='image6_thumb.jpg' /></a>
</div>
</body>
</html>
''' # pip3 install lxml
from lxml import etree html = etree.HTML(doc)
# html=etree.parse('search.html',etree.HTMLParser()) ### Case study
# 1 All nodes
a = html.xpath('//*') # 2 Specify nodes ( The result is a list )
a = html.xpath('//head') # 3 Child node , Descendants node
# a = html.xpath('//div/a')
# a = html.xpath('//body/a') # No data
a = html.xpath('//body//a') # 4 Parent node [@href="image1.html"] Attribute matching
# a = html.xpath('//body//a[@href="image1.html"]/..')
# a = html.xpath('//body//a[1]/..')
# You can do that
# a = html.xpath('//body//a[1]/parent::*')
a = html.xpath('//body//a[1]/parent::body') # nothing # 5 Attribute matching
a = html.xpath('//body//a[@href="image1.html"]') # 6 Text acquisition
a = html.xpath('//body//a[@href="image1.html"]/text()') # 7 Property acquisition
# a = html.xpath('//body//a/@href')
# # Attention from 1 Began to take ( Not from 0)
a = html.xpath('//body//a[1]/@href') # 8 Attribute multi value matching
# a There are multiple labels class class , Direct matching is not ok , Need to use contains
# a = html.xpath('//body//a[@class="li"]') # Can't find , because a There are multiple classes
# a = html.xpath('//body//a[contains(@class,"li")]')
a = html.xpath('//body//a[contains(@class,"li")]/text()') # 9 Multiple attribute matching
# a = html.xpath('//body//a[contains(@class,"li") or @name="items"]')
# a = html.xpath('//body//a[contains(@class,"li") and @name="items"]/text()')
a = html.xpath('//body//a[contains(@class,"li")]/text()') # 10 Select in order
# a = html.xpath('//a[2]/text()')
# a = html.xpath('//a[2]/@href')
# Take the last one
# a = html.xpath('//a[last()]/@href')
# The position is less than 3 Of
# a = html.xpath('//a[position()<3]/@href')
# second to last
a = html.xpath('//a[last()-2]/@href') # 11 Node axis selection
# ancestor: Ancestral node
# Used * Get all ancestor nodes
# a = html.xpath('//a/ancestor::*')
# # Get... In the ancestor node div
# a = html.xpath('//a/ancestor::div')
# attribute: Property value
# a = html.xpath('//a[1]/attribute::id')
# child: Direct child nodes
# a = html.xpath('//a[1]/child::*')
# descendant: All the descendants
a = html.xpath('//a[6]/descendant::*') # # following: All nodes after the current node
# a = html.xpath('//a[1]/following::*')
# a = html.xpath('//a[1]/following::*[1]/@href')
# # following-sibling: The peer node after the current node
# a = html.xpath('//a[1]/following-sibling::*')
# a = html.xpath('//a[1]/following-sibling::a')
# a = html.xpath('//a[1]/following-sibling::*[2]')
a = html.xpath('//a[1]/following-sibling::*[2]/@href') # Ultimate trick ---》 Copy
# //*[@id="maincontent"]/div[5]/table/tbody/tr[2]/td[2] print(a)

4、 Use selenium Get the commodity information of Jingdong

from selenium import webdriver
from selenium.webdriver.common.keys import Keys # Keyboard keystroke operation
import time def get_goods(driver):
try:
# Find all classes named gl-item The label of
goods = driver.find_elements_by_class_name('gl-item')
for good in goods:
detail_url = good.find_element_by_tag_name('a').get_attribute('href')
p_name = good.find_element_by_css_selector('.p-name em').text.replace('\n', '')
price = good.find_element_by_css_selector('.p-price i').text
p_commit = good.find_element_by_css_selector('.p-commit a').text
img = good.find_element_by_css_selector('div.p-img img').get_attribute('src')
if not img:
img = 'http:' + good.find_element_by_css_selector('div.p-img img').get_attribute('data-lazy-img')
msg = '''
goods : %s
link : %s
picture : %s
Price :%s
Comment on :%s
''' % (p_name, detail_url, img, price, p_commit) print(msg, end='\n\n') button = driver.find_element_by_partial_link_text(' The next page ')
button.click()
time.sleep(1)
get_goods(driver)
except Exception:
pass def spider(url, keyword):
driver = webdriver.Chrome(executable_path='./chromedriver')
driver.get(url)
driver.implicitly_wait(3) # Use implicit wait
try:
input_tag = driver.find_element_by_id('key')
input_tag.send_keys(keyword)
input_tag.send_keys(Keys.ENTER) # Knock back
get_goods(driver)
finally:
driver.close() if __name__ == '__main__':
spider('https://www.jd.com/', keyword=' Boutique underwear ')

5、scrapy frame Introduction and installation

#  I learned before  requests,bs4,selenium   They are called modules 

# scrapy :
frame Be similar to djagno frame , Write fixed code in a fixed position # Write a crawler project based on this framework
# install :
pip3 install scrapy # mac linux No problem
win It may not fit (90% Can fit )---》 Because twisted Can't fit # If win Can't fit Follow these steps :
1、pip3 install wheel # With it , Support direct use in the future whl Files installed
# After installation , We support the adoption of wheel File installation software ,wheel Document website :https://www.lfd.uci.edu/~gohlke/pythonlibs
2、pip3 install lxml
3、pip3 install pyopenssl
4、 Download and install pywin32:https://sourceforge.net/projects/pywin32/files/pywin32/
5、 download twisted Of wheel file :http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
6、 perform pip3 install Download directory \Twisted-17.9.0-cp36-cp36m-win_amd64.whl
7、pip3 install scrapy # After loading , There will be one scrapy Executable file Equate to django-admin
# establish scrapy project Equate to django-admin
You can go to cmd Execute... Under the window
scrapy startproject myfirst
Use pycharm open # Create crawler Equate to django establish app
scrapy genspider Reptile name Crawler address
scrapy genspider cnblogs www.cnblogs.com # Run crawler
scrapy crawl cnblogs

selenium Module use details 、 The coding platform uses 、xpath Use 、 Use selenium Get the commodity information of Jingdong 、scrapy More articles about framework introduction and installation

  1. selenium+phantomjs Get the commodity information of Jingdong

    selenium+phantomjs Get the commodity information of Jingdong Today, I actually wrote a crawling Jingdong commodity information , The same idea as in the previous article , Attach a link :https://www.cnblogs.com/cany/p/10897618. ...

  2. utilize selenium Crawl Jingdong commodity information and store it in mongodb

    utilize selenium The idea of crawling through the commodity information of Jingdong Mall : 1. First, enter JD's search page , Analyzing the search page information can get the routing structure 2. According to the page information, jd.com uses lazy loading on the search page , So in order to solve this problem , Using recursion . wait for ...

  3. Reptiles selenium Get the commodity information of Jingdong

    import json import time from selenium import webdriver """ Send a request 1.1 Generate driver object 2.1 The window is the largest ...

  4. python Reptiles —— use selenium Get the commodity information of Jingdong

    1. First attach the effect picture ( I was lazy and only climbed 4 page )  2. Jingdong's website https://www.jd.com/ 3. I don't load pictures here , Speed up the climb , It can also be used. Headless No pop-up mode options = webdri ...

  5. Reptiles —Selenium Crawling JD Commodity information

    One , Grab analysis The goal of this time is to crawl the commodity information of Jingdong , Including pictures of products , name , Price , Number of evaluators , Shop name . Grab the entrance is the search page of Jingdong , This link can be accessed by directly constructing parameters https://search.jd.com/Sea ...

  6. Reptile series ( 13、 ... and ) use selenium Crawling for Jingdong products

    This article , We will pass selenium Simulate the behavior of users using the browser , Get the commodity information of Jingdong , Let's put the final rendering first : 1. Web analytics (1) Preliminary analysis Originally, the blogger intended to write a crawler that can crawl all the product information , But in the analysis process ...

  7. Learn reptiles together —— Use selenium and pyquery Grab the list of Jingdong products

    layout: article title: Learn reptiles together -- Use selenium and pyquery Grab the list of Jingdong products mathjax: true --- Learn to use together today selenium and pyquery climb ...

  8. Scrapy Actual combat ( 7、 ... and ) And Scrapy coordination Selenium Crawl to Jingdong Mall information ( Next )

    We used selenium Add Firefox As a download middleware to crawl JD's commodity information . But in large-scale crawling ,Firefox It consumes more resources , Therefore, we hope to use a less resource consuming method to crawl relevant information . Next ...

  9. selenium Detailed explanation of module usage

    selenium Usage details selenium It's mainly used for automated testing , Support multiple browsers , Crawler is mainly used to solve the problem JavaScript Rendering problems . Simulate the browser to load the web page , When requests,urllib Can't get ...

  10. Scrapy frame —— Introduce 、 install 、 Command line creation , start-up 、 Introduction to project directory structure 、Spiders Folder details ( Including de duplication rules )、Selectors Parsing the page 、Items、pipelines( Customize pipeline)、 Download Middleware (Downloader Middleware)、 Crawler middleware 、 The signal

    One Introduce Scrapy An open source and collaborative framework , It was originally for page crawling ( More specifically , Network capture ) Designed by , It can be used quickly . Simple . Extensible way to extract the required data from the website . But at the moment, Scrapy It has a wide range of uses , can ...

Random recommendation

  1. Find and delete duplicate files

    effect : Find the specified directory ( One or more ) And all duplicate files in subdirectories , List in groups , It can manually select or automatically delete redundant and duplicate files at random , Only one copy of each group of duplicate documents shall be retained .( Support file names with spaces , for example :"file  name" ...

  2. iOS Multilingual switching

    One . Add the international language that the application needs to support Two . Create a new one Localizable.strings file , As a multilingual dictionary , Store multiple languages 3、 ... and . stay Localizable.strings The corresponding files of are configured in the form of key value pairs ...

  3. C. Santa Claus and Robot Thinking questions

    http://codeforces.com/contest/752/problem/C The meaning of this question is to fix you x A little bit , Make this in order x After one point , The resulting trajectory is a given sequence . There are several shortest paths to i ...

  4. TextView Click the hyperlink to jump to the next Activity

    1:activity_main.xml <RelativeLayout xmlns:android="http://schemas.android.com/apk/res/androi ...

  5. IOS9.0 After that, the alliance will share the detailed process

    One : Apply for alliance of friends AppKey( Fraternal Key Is generated based on the name of the application !) After you register your own developer account, you can apply for AppKey 了 . And then set... In this method Key - (BOOL)application:( ...

  6. java To play mp3 function

    1. First introduce jlayer.jar <!-- https://mvnrepository.com/artifact/javazoom/jlayer --> <dependency> ...

  7. javascript Medium number

    Everybody knows javascript There are five simple data types in ,number,string,boolean,null,undefined, Complex data types are object. This paper mainly records number Some types may not be very common ...

  8. java Homework of the third week

    1.P132 analysis : long before = System.currentTimeMillis(); // Returns the current computer time , The expression format of time is the current computer time and GMT Time ( Greenwich mean time )1970 year 1 ...

  9. Jmeter Use notes html Report extension ( One )

    . : In use loadrunner You can generate a HTML The report of , And it contains various charts , Various detailed data . But in the use of Jmeter After testing, it cannot be generated directly Html The report of ( Whether it's for use GUI Or start from the command line ). after ...

  10. WPF Learning notes (3):ListView Automatically adjust the column width according to the content

    DataGrid in , Just don't set DataGrid Width of and column width , Or set the width to Auto, Then the table will automatically adjust the width according to the content , To display everything . But if it is ListView, Set... As above , But it can't reach the automatic adjustment of column width ...