Manual Workflow

Step by step guide to using the Manual Scraper in MrScraper

The Manual Scraper in MrScraper allows you to create custom scraping workflows by defining a series of steps to extract data from web pages.

Tip

This mode is ideal for users who need full control over the scraping process and want to handle complex scenarios.

When to Use Manual Workflow

Use the Manual Scraper when you need to:

  • Extract data from complex or dynamic websites that require specific interactions.
  • Implement custom workflows that involve multiple steps, such as navigation, data extraction, and pagination.
  • Handle situations where our markdown converter cleans or alters the HTML, causing the AI to fail when parsing it into JSON.

Manual Scraper Features

Below are the available step types you can add when building a manual scraper:

Step TypeDescription
ExtractScrape data from the webpage by setting an Extraction Name, choosing an Extraction Type (Text, Inner HTML, Outer HTML, or Attribute), and defining CSS selectors to target elements.
ClickSimulates a click action on a specified element.
DelayPauses the scraper for a set duration (in milliseconds) before moving to the next step.
InputEnters text into input fields on the webpage.
ScrollScrolls to the end of the page, scrolls until a specific text is found, or scrolls to a specific element (or until a certain number of elements).
Inject JavaScriptRuns custom JavaScript code on the webpage.
PaginateAutomatically navigates through multiple pages.

Pagination Types

Pagination TypeRequired Parameter
Query PaginationQuery parameter (e.g., ?page=2)
Directory PaginationNone (e.g., /list/page/2)
Next Page LinkNext page selector

Usage Example

Example 1 : Simple Website

Suppose you want to scrape data from this product page:

https://www.target.com/p/beats-studio-pro-bluetooth-wireless-headphones/-/A-89459966

On this site, the price is loaded dynamically through an internal API, so the AI Scraper won’t be able to extract it automatically. To capture this data, you’ll need to switch to the Manual Workflow and inject a small custom JavaScript snippet.

Follow the steps below:

  1. Create a scraper using the product URL above.
  2. After the AI Scraper completes the initial extraction, open the Manual Workflow tab at the top.
  3. Add Inject JavaScript step.
  4. Fill the Name field with data, and Script timeout with 50.
  5. Use this script:
Script
(() => { return window.__TGT_DATA__.__PRELOADED_QUERIES__.queries})();

Tip

You can locate the price by inspecting the page’s source code.
For this specific page, the price is available under the __TGT_DATA__ object in the window.

Note: The data structure varies by website, so the location of price information may differ on other pages.

Don't know how to write your own script?

Need help writing a custom script for your manual workflow? Contact Us.

  1. Save configuration > Run scraper.
  2. The scraper returns the following output: Result :
Manual workflow result
{
  "data": {
    "...",
    "product": {
      "...",
      "price": {
        "formatted_comparison_price": "$349.99",
        "formatted_comparison_price_type": "reg",
        "formatted_current_price": "$165.99 - $169.99",
        "formatted_current_price_type": "sale",
        "location_id": 3991,
        "current_retail_min": 165.99,
        "reg_retail_max": 349.99
      }
      "...",
    }
    "...",
  }
}

Example 2 : Complex Website

Some websites cannot be scraped with the AI Scraper alone, especially when you need to fill in a form before data appears. One example is:

https://www.handelsregister.de/rp_web/normalesuche/welcome.xhtml

Since the site requires entering search details before showing company information, you’ll need to use a Manual Workflow to automate the steps.

Follow the steps below:

  1. Create a scraper using the product URL above.
  2. After the AI Scraper completes the initial extraction, open the Manual Workflow tab at the top.
  3. Add Input Step, fill Input Field Selector with textarea[title="Company or keywords:"] and Text Input for the company you want to search, for this example we'll use Volkswagen & Audi Club.
  4. Add Delay Step with 1000 ms duration.
  5. Add Click Step and select the button[name="form:btnSuche"] as the Element to Click.
  6. Add another Delay Step with10000 ms duration.
  7. Add Inject JavaScript Step, Fill the Name field with data, and Script timeout with 900, then use this script :

Don't know how to write your own script?

Need help writing a custom script for your manual workflow? Contact Us.

Script
(async () => {
  
  // Wait for correct URL with timeout
  const targetUrl = "https://www.handelsregister.de/rp_web/sucheErgebnisse/welcome.xhtml?cid=1";
  const maxAttempts = 15;
  const checkInterval = 2000; // 2 seconds
  
  let attempts = 0;
  let urlMatches = false;
  
  console.log('Waiting for correct URL...');
  
  while (attempts < maxAttempts) {
    if (window.location.href === targetUrl) {
      urlMatches = true;
      console.log('URL matched! Starting to parse...');
      break;
    }
    
    attempts++;
    console.log(`Attempt ${attempts}/${maxAttempts}: Current URL does not match. Waiting...`);
    await new Promise(resolve => setTimeout(resolve, checkInterval));
  }
  
  if (!urlMatches) {
    console.log('Timeout: URL never matched the target URL');
    return null;
  }
  
  // Helper function to create unique key for an entry
  function getEntryKey(entry) {
    return `${entry.region}|${entry.court}|${entry.companyName}|${entry.location}`;
  }
  
  // Helper function to check if page has changed
  function hasPageChanged(currentResults, lastPageKeys) {
    if (currentResults.length === 0) return false;
    
    const currentKeys = currentResults.map(getEntryKey);
    // Check if at least one entry is different
    return currentKeys.some(key => !lastPageKeys.has(key));
  }
  
  // Parsing functions
  function parseCurrentPage() {
    const table = document.getElementById('ergebnissForm:selectedSuchErgebnisFormTable_data');
    if (!table) {
      console.log('Table not found on page');
      return [];
    }
    
    const rows = table.querySelectorAll('tr[data-ri]');
    const results = [];
    
    rows.forEach(row => {
      const entry = {};
      
      // Extract region and court info
      const headerCell = row.querySelector('.fontTableNameSize');
      if (headerCell) {
        const headerText = headerCell.textContent.trim();
        const parts = headerText.split(/\s{2,}/);
        entry.region = parts[0]?.trim() || '';
        entry.court = parts[1]?.trim() || '';
      }
      
      // Extract company name
      const nameCell = row.querySelector('.marginLeft20');
      if (nameCell) {
        entry.companyName = nameCell.textContent.trim();
      }
      
      // Extract location (Sitz)
      const locationCell = row.querySelector('.sitzSuchErgebnisse .verticalText');
      if (locationCell) {
        entry.location = locationCell.textContent.trim();
      }
      
      // Extract registration status
      const statusCells = row.querySelectorAll('.verticalText');
      if (statusCells.length > 1) {
        entry.status = statusCells[1].textContent.trim();
      }
      
      // Extract document types (AD, CD, HD, etc.)
      const docLinks = row.querySelectorAll('.dokumentList .underlinedText');
      entry.documentTypes = Array.from(docLinks).map(link => link.textContent.trim());
      
      // Extract history entries
      const historyRows = row.querySelectorAll('.RegPortErg_HistorieZn');
      if (historyRows.length > 0) {
        entry.history = [];
        historyRows.forEach(histRow => {
          const histText = histRow.querySelector('.fontSize85')?.textContent.trim();
          const histLocation = histRow.closest('tr')?.querySelector('.RegPortErg_SitzStatus .fontSize85')?.textContent.trim();
          
          if (histText) {
            entry.history.push({
              name: histText,
              location: histLocation || ''
            });
          }
        });
      }
      
      results.push(entry);
    });
    
    return results;
  }
  
  function isNextButtonDisabled() {
    const nextButton = document.querySelector('a.ui-paginator-next');
    return nextButton && nextButton.classList.contains('ui-state-disabled');
  }
  
  const allResults = [];
  const seenKeys = new Set();
  let pageNumber = 1;
  const delay = 1000;
  const maxRetries = 3; // Max retries if page hasn't changed
  
  console.log(`Parsing page ${pageNumber}...`);
  
  // Parse first page
  const firstPageResults = parseCurrentPage();
  firstPageResults.forEach(entry => {
    const key = getEntryKey(entry);
    if (!seenKeys.has(key)) {
      seenKeys.add(key);
      allResults.push(entry);
    }
  });
  console.log(`Page ${pageNumber}: Found ${firstPageResults.length} entries (${allResults.length} unique so far)`);
  
  // Continue clicking next until button is disabled
  while (!isNextButtonDisabled()) {
    const nextButton = document.querySelector('a.ui-paginator-next');
    
    if (!nextButton) {
      console.log('Next button not found');
      break;
    }
    
    // Store current page keys for comparison
    const lastPageKeys = new Set(seenKeys);
    
    // Click next button
    nextButton.click();
    pageNumber++;
    
    // Wait for page to load and retry if needed
    let retries = 0;
    let pageChanged = false;
    
    while (retries < maxRetries) {
      await new Promise(resolve => setTimeout(resolve, delay));
      
      const currentPageResults = parseCurrentPage();
      pageChanged = hasPageChanged(currentPageResults, lastPageKeys);
      
      if (pageChanged) {
        console.log(`Parsing page ${pageNumber}...`);
        
        // Add only unique entries
        let newEntries = 0;
        currentPageResults.forEach(entry => {
          const key = getEntryKey(entry);
          if (!seenKeys.has(key)) {
            seenKeys.add(key);
            allResults.push(entry);
            newEntries++;
          }
        });
        
        console.log(`Page ${pageNumber}: Found ${currentPageResults.length} entries (${newEntries} new, ${allResults.length} unique total)`);
        break;
      } else {
        retries++;
        if (retries < maxRetries) {
          console.log(`Page ${pageNumber}: Data not updated yet, retrying (${retries}/${maxRetries})...`);
        }
      }
    }
    
    if (!pageChanged) {
      console.log(`Page ${pageNumber}: Data still not updated after ${maxRetries} retries, skipping...`);
    }
  }
  
  console.log(`\nCompleted! Total unique entries: ${allResults.length}`);
  return allResults;
  
})()
  1. Save configuration > Run the scraper.
  2. The scraper returns the following output:

Result :

Manual workflow result
{
  "data": [
    {
      "region": "Bavaria",
      "court": "District court München VR 201131",
      "companyName": "1.Volkswagen & Audi Club Mittenwald e.V.",
      "location": "Mittenwald",
      "status": "currently registered",
      "documentTypes": [
        "AD",
        "CD",
        "DK",
        "UT",
        "VÖ",
        "SI"
      ]
    },
    {
      "region": "Baden-Württemberg",
      "court": "District court Stuttgart VR 381348",
      "companyName": "VW-Audi Club Härten e.V.",
      "location": "Kusterdingen",
      "status": "currently registered",
      "documentTypes": [
        "AD",
        "CD",
        "HD",
        "DK",
        "UT",
        "VÖ",
        "SI"
      ],
      "history": [
        {
          "name": "1.) VW-Audi Club Härten",
          "location": "1.) Kusterdingen"
        }
      ]
    }
  ]
}