Selenium Web Scraping Code Explanation | Generated by AI

Home 2025.10

Explanation of Selenium API Usage in the Provided Code

This Python script uses the Selenium WebDriver library to automate a web browser (specifically Chrome) for interacting with a website (likely a data/trade portal at https://xxdata.com/index.aspx). The goal appears to be web scraping: logging in, searching for trade/order data (using a HS code like ‘8449’), navigating through paginated results, clicking on details, extracting structured data from HTML tables, and storing it in a SQLite database (data.db).

Selenium is used here for browser automation, allowing the script to mimic human interactions like typing, clicking, scrolling, and waiting for page loads—something that’s hard or impossible with simple HTTP requests (e.g., due to JavaScript-heavy pages, iframes, or dynamic content). It handles dynamic web elements, sessions, and multi-window/iframe switching, which are common in modern web apps.

I’ll break it down section by section, highlighting key Selenium APIs, their purpose, and how they’re used. Note: Some methods in the code (e.g., find_element_by_css_selector) are from older Selenium versions (pre-4.0) and are deprecated. In modern Selenium (4+), you’d use find_element(By.CSS_SELECTOR, ...) instead, but the functionality is the same. The script also imports necessary modules for waits, exceptions, and element handling.

1. Imports and Setup (Selenium Initialization)

   from selenium import webdriver
   from selenium.webdriver.chrome.webdriver import WebDriver
   from selenium.webdriver.common.keys import Keys
   from selenium.common.exceptions import TimeoutException, ElementClickInterceptedException, StaleElementReferenceException
   from selenium.webdriver.support.ui import WebDriverWait
   from selenium.webdriver.support import expected_conditions as EC
   from selenium.webdriver.common.by import By
   from selenium.webdriver.remote.webelement import WebElement

In run() function:

   options = webdriver.ChromeOptions()
   options.add_argument("--start-maximized")  # Opens browser in full screen.
   options.add_argument('--log-level=3')      # Suppresses console logs for cleaner output.
   browser: WebDriver = webdriver.Chrome(executable_path="./chromedriver", options=options)
   browser.get('https://xxdata.com/index.aspx')

2. Login Process

   input_username = browser.find_element_by_css_selector('input[name=username]')
   input_username.send_keys('name')
   input_password = browser.find_element_by_css_selector('input[name=password]')
   input_password.send_keys('password')
   btn_login = browser.find_element_by_css_selector('div.login-check')
   btn_login.click()

After login:

   wait_element(browser, 'div.dsh_01')
   trade_div = browser.find_element_by_css_selector('div.dsh_01')
   trade_div.click()
   wait_element(browser, 'a.teq_icon')
   teq = browser.find_element_by_css_selector('a.teq_icon')
   teq.click()
   wait_element(browser, 'div.panel-body')
   iframe = browser.find_element_by_css_selector('div.panel-body > iframe')
   iframe_id = iframe.get_attribute('id')
   browser.switch_to.frame(iframe_id)

Search process:

   input_search = browser.find_element_by_id('_easyui_textbox_input7')  # Uses ID locator.
   input_search.send_keys('8449')
   time.sleep(10)
   enter = browser.find_element_by_css_selector('a#btnOk > div.enter-bt')
   enter.click()

4. Pagination and Result Processing

   result_count_span = browser.find_element_by_css_selector('span#ResultCount')
   page = math.ceil(int(result_count_span.text) / 20)  # Calculates total pages (20 results/page).
   skip = 0
   page = page - skip

   for p in range(page):
       input_page = browser.find_element_by_css_selector('input.laypage_skip')
       input_page.send_keys(str(p + skip + 1))
       btn_confirm = browser.find_element_by_css_selector('button.laypage_btn')
       btn_confirm.click()
       time.sleep(2)

       locates = browser.find_elements_by_css_selector('div.rownumber-bt')  # Multiple elements.
       print('page ' + str(p) + ' size: ' + str(len(locates)))
       for locate in locates:
           browser.execute_script("arguments[0].scrollIntoView();", locate)  # JavaScript scroll.
           time.sleep(1)
           browser.find_element_by_tag_name('html').send_keys(Keys.PAGE_UP)  # Keyboard scroll.
           time.sleep(1)
           try:
               locate.click()
           except ElementClickInterceptedException:
               print('ElementClickInterceptedException')
               continue
           except StaleElementReferenceException:
               print('StaleElementReferenceException')
               continue
           # ... (more below)

5. Window/Iframe Switching and Data Extraction

Continuing from the loop:

   time.sleep(1)
   browser.switch_to.window(browser.window_handles[1])  # Switch to new tab/window.
   wait_element(browser, 'div#content')
   try:
       save_page(browser)
   except IndexError:
       print('IndexError')
       continue
   browser.close()  # Closes the detail window.
   browser.switch_to.window(browser.window_handles[0])  # Back to main window.
   browser.switch_to.frame(iframe_id)  # Back to iframe context.

6. Data Extraction in save_page(browser: WebDriver) Function

This is the core scraping logic:

   ts = browser.find_elements_by_css_selector('table')  # All tables on the page.
   t0 = ts[0]
   tds0 = t0.find_elements_by_tag_name('td')  # TD cells in first table.
   order_number = tds0[2].text  # Extracts text from specific cells.
   # ... (similar for other tables: t1, t2, etc.)

Conditional logic extracts fields like order_number, importer, exporter, etc., based on table indices—assuming a fixed layout.

7. Waits and Error Handling (wait_element Function)

   def wait_element(browser, css):
       timeout = 30
       try:
           element_present = EC.presence_of_element_located((By.CSS_SELECTOR, css))
           WebDriverWait(browser, timeout).until(element_present)
       except TimeoutException:
           print('Timed out waiting for page to load')

8. Cleanup

   time.sleep(1000)  # Long pause (debugging?).
   browser.quit()    # Closes browser and ends session.

Overall How Selenium Fits In

If you have questions about specific parts, modernizing the code, or debugging, let me know!


Back

x-ai/grok-4-fast

Donate