問題

我想清除這個站點上的所有資料.

我的指令碼的這部分將單擊生成我想要清除的資料行所需的“搜尋”按鈕:

 from selenium import webdriver
import os
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time
import sys
import re
import requests

#options.add_argument("--headless")

driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver')
base_url = 'https://drugdesign.riken.jp/hERGdb/'
driver.get(base_url)

#click the button that says search
element = driver.find_element_by_css_selector('[name=Structure_Search]').click()
 

然後我需要單擊每個LOT_ID,它將我帶到像這個這樣的頁面,我可以用這段程式碼進行刪除:

 base_url = 'https://drugdesign.riken.jp/hERGdb/compound.php?HGID=HG-0260086'
driver.get(base_url)

## compound information table
hgid = driver.find_element_by_xpath('//tbody/tr/th[contains(.,"HGID")]/following::td[1]')
drug_name = driver.find_element_by_xpath('//tbody/tr/th[contains(.,"Drug_name")]/following::td[1]')
MW = driver.find_element_by_xpath('//tbody/tr/th[contains(.,"MW")]/following::td[1]')
Formula = driver.find_element_by_xpath('//tbody/tr/th[contains(.,"Formula")]/following::td[1]')


## ID relation table
id_table = driver.find_elements_by_xpath('/html/body/div[2]/div/div/div[2]/table[2]/tbody')
for x in id_table:
       print(x.text)


## in vitro assay information table
assay_data = driver.find_elements_by_xpath('/html/body/div[2]/div/div/div[2]/table[3]/tbody')
for x in assay_data:
       print(x.text)
 

我無法理解如何遍歷網站上的所有LOT_ID(例如每頁只顯示10個,似乎有> 300 000結果,但只顯示了1 000個).所以最終的問題是如何遍歷所有> 300 000 LOT_ID,他們說這是我搜索的結果,所以我可以在其上執行我的程式碼(上面)的第二部分(每個單獨頁面上執行).

我一直在尋找SO,我嘗試過類似的東西:

 #table = driver.find_element_by_css_selector('//*[@id="foo-table"]/tbody/tr[1]/td[3]/a')
#print(table)
 

和類似使用XPaths等,但我收到錯誤如下:

 selenium.common.exceptions.InvalidSelectorException: Message: invalid selector: An invalid or illegal selector was specified
  (Session info: chrome=77.0.3865.90)
 

所以,如果有人可以填充我的程式碼中間部分(我認為它應該只是最大的一行或兩行?),這將顯示我如何迴圈遍歷> 300,000 LOT_ID並單擊它們,以便將我帶到我然後洗滌的頁面,我將不勝感激.

  最佳答案

您可以使用獲取所有連結.下面的程式碼列印所有1000個連結:

 import requests
from bs4 import BeautifulSoup

base_url = "https://drugdesign.riken.jp/hERGdb"    
data = [
  ('smiles_S', ''),
  ('jme_S', ''),
  ('tab_selected', 'tab_S'),
  ('query_type', 'Substructure'),
  ('Target[]', 'hERG'),
  ('Target[]', 'Cav1.2'),
  ('Target[]', 'Nav1.5'),
  ('Target[]', 'Kv1.5'),
  ('Value_type[]', 'IC50'),
  ('Value_type[]', 'inhibition'),
  ('Value_type[]', 'other'),
  ('Assay_type[]', 'binding'),
  ('Assay_type[]', 'patch clamp'),
  ('Assay_type[]', 'other'),
  ('Data_source[]', 'ChEMBL'),
  ('Data_source[]', 'PubChem_CID'),
  ('Data_source[]', 'hERG Central(PubChem_SID)'),
  ('low_MW', ''),
  ('high_MW', ''),
  ('Assay_name', ''),
  ('Structure_Search', 'Search'),
]

response = requests.post(f'{base_url}/result.php', data=data)
lots = BeautifulSoup(response.text, "html.parser").select("a[href^='./compound.php?HGID=']")
for lot in lots:
    url = str(lot['href']).replace("./", "")
    print(f"{base_url}/{url}")
 

  相同標籤的其他問題

pythonseleniumrequestsbeautifulsoup