Manual Workflow
Step by step guide to using the Manual Scraper in MrScraper
The Manual Scraper in MrScraper allows you to create custom scraping workflows by defining a series of steps to extract data from web pages.
Tip
This mode is ideal for users who need full control over the scraping process and want to handle complex scenarios.
When to Use Manual Workflow
Use the Manual Scraper when you need to:
- Extract data from complex or dynamic websites that require specific interactions.
- Implement custom workflows that involve multiple steps, such as navigation, data extraction, and pagination.
- Handle situations where our markdown converter cleans or alters the HTML, causing the AI to fail when parsing it into JSON.
Manual Scraper Features
Below are the available step types you can add when building a manual scraper:
| Step Type | Description |
|---|---|
| Extract | Scrape data from the webpage by setting an Extraction Name, choosing an Extraction Type (Text, Inner HTML, Outer HTML, or Attribute), and defining CSS selectors to target elements. |
| Click | Simulates a click action on a specified element. |
| Delay | Pauses the scraper for a set duration (in milliseconds) before moving to the next step. |
| Input | Enters text into input fields on the webpage. |
| Scroll | Scrolls to the end of the page, scrolls until a specific text is found, or scrolls to a specific element (or until a certain number of elements). |
| Inject JavaScript | Runs custom JavaScript code on the webpage. |
| Paginate | Automatically navigates through multiple pages. |
Pagination Types
| Pagination Type | Required Parameter |
|---|---|
| Query Pagination | Query parameter (e.g., ?page=2) |
| Directory Pagination | None (e.g., /list/page/2) |
| Next Page Link | Next page selector |
Usage Example
Example 1 : Simple Website
Suppose you want to scrape data from this product page:
https://www.target.com/p/beats-studio-pro-bluetooth-wireless-headphones/-/A-89459966
On this site, the price is loaded dynamically through an internal API, so the AI Scraper won’t be able to extract it automatically. To capture this data, you’ll need to switch to the Manual Workflow and inject a small custom JavaScript snippet.
Follow the steps below:
- Create a scraper using the product URL above.
- After the AI Scraper completes the initial extraction, open the Manual Workflow tab at the top.
- Add
Inject JavaScriptstep. - Fill the Name field with
data, and Script timeout with50. - Use this script:
(() => { return window.__TGT_DATA__.__PRELOADED_QUERIES__.queries})();Tip
You can locate the price by inspecting the page’s source code.
For this specific page, the price is available under the __TGT_DATA__ object in the window.
Note: The data structure varies by website, so the location of price information may differ on other pages.
Don't know how to write your own script?
Need help writing a custom script for your manual workflow? Contact Us.
- Save configuration > Run scraper.
- The scraper returns the following output: Result :
{
"data": {
"...",
"product": {
"...",
"price": {
"formatted_comparison_price": "$349.99",
"formatted_comparison_price_type": "reg",
"formatted_current_price": "$165.99 - $169.99",
"formatted_current_price_type": "sale",
"location_id": 3991,
"current_retail_min": 165.99,
"reg_retail_max": 349.99
}
"...",
}
"...",
}
}Example 2 : Complex Website
Some websites cannot be scraped with the AI Scraper alone, especially when you need to fill in a form before data appears. One example is:
https://www.handelsregister.de/rp_web/normalesuche/welcome.xhtml
Since the site requires entering search details before showing company information, you’ll need to use a Manual Workflow to automate the steps.
Follow the steps below:
- Create a scraper using the product URL above.
- After the AI Scraper completes the initial extraction, open the Manual Workflow tab at the top.
- Add
InputStep, fill Input Field Selector withtextarea[title="Company or keywords:"]and Text Input for the company you want to search, for this example we'll useVolkswagen & Audi Club. - Add
DelayStep with1000ms duration. - Add
ClickStep and select thebutton[name="form:btnSuche"]as the Element to Click. - Add another
DelayStep with10000ms duration. - Add
Inject JavaScriptStep, Fill the Name field withdata, and Script timeout with900, then use this script :
Don't know how to write your own script?
Need help writing a custom script for your manual workflow? Contact Us.
(async () => {
// Wait for correct URL with timeout
const targetUrl = "https://www.handelsregister.de/rp_web/sucheErgebnisse/welcome.xhtml?cid=1";
const maxAttempts = 15;
const checkInterval = 2000; // 2 seconds
let attempts = 0;
let urlMatches = false;
console.log('Waiting for correct URL...');
while (attempts < maxAttempts) {
if (window.location.href === targetUrl) {
urlMatches = true;
console.log('URL matched! Starting to parse...');
break;
}
attempts++;
console.log(`Attempt ${attempts}/${maxAttempts}: Current URL does not match. Waiting...`);
await new Promise(resolve => setTimeout(resolve, checkInterval));
}
if (!urlMatches) {
console.log('Timeout: URL never matched the target URL');
return null;
}
// Helper function to create unique key for an entry
function getEntryKey(entry) {
return `${entry.region}|${entry.court}|${entry.companyName}|${entry.location}`;
}
// Helper function to check if page has changed
function hasPageChanged(currentResults, lastPageKeys) {
if (currentResults.length === 0) return false;
const currentKeys = currentResults.map(getEntryKey);
// Check if at least one entry is different
return currentKeys.some(key => !lastPageKeys.has(key));
}
// Parsing functions
function parseCurrentPage() {
const table = document.getElementById('ergebnissForm:selectedSuchErgebnisFormTable_data');
if (!table) {
console.log('Table not found on page');
return [];
}
const rows = table.querySelectorAll('tr[data-ri]');
const results = [];
rows.forEach(row => {
const entry = {};
// Extract region and court info
const headerCell = row.querySelector('.fontTableNameSize');
if (headerCell) {
const headerText = headerCell.textContent.trim();
const parts = headerText.split(/\s{2,}/);
entry.region = parts[0]?.trim() || '';
entry.court = parts[1]?.trim() || '';
}
// Extract company name
const nameCell = row.querySelector('.marginLeft20');
if (nameCell) {
entry.companyName = nameCell.textContent.trim();
}
// Extract location (Sitz)
const locationCell = row.querySelector('.sitzSuchErgebnisse .verticalText');
if (locationCell) {
entry.location = locationCell.textContent.trim();
}
// Extract registration status
const statusCells = row.querySelectorAll('.verticalText');
if (statusCells.length > 1) {
entry.status = statusCells[1].textContent.trim();
}
// Extract document types (AD, CD, HD, etc.)
const docLinks = row.querySelectorAll('.dokumentList .underlinedText');
entry.documentTypes = Array.from(docLinks).map(link => link.textContent.trim());
// Extract history entries
const historyRows = row.querySelectorAll('.RegPortErg_HistorieZn');
if (historyRows.length > 0) {
entry.history = [];
historyRows.forEach(histRow => {
const histText = histRow.querySelector('.fontSize85')?.textContent.trim();
const histLocation = histRow.closest('tr')?.querySelector('.RegPortErg_SitzStatus .fontSize85')?.textContent.trim();
if (histText) {
entry.history.push({
name: histText,
location: histLocation || ''
});
}
});
}
results.push(entry);
});
return results;
}
function isNextButtonDisabled() {
const nextButton = document.querySelector('a.ui-paginator-next');
return nextButton && nextButton.classList.contains('ui-state-disabled');
}
const allResults = [];
const seenKeys = new Set();
let pageNumber = 1;
const delay = 1000;
const maxRetries = 3; // Max retries if page hasn't changed
console.log(`Parsing page ${pageNumber}...`);
// Parse first page
const firstPageResults = parseCurrentPage();
firstPageResults.forEach(entry => {
const key = getEntryKey(entry);
if (!seenKeys.has(key)) {
seenKeys.add(key);
allResults.push(entry);
}
});
console.log(`Page ${pageNumber}: Found ${firstPageResults.length} entries (${allResults.length} unique so far)`);
// Continue clicking next until button is disabled
while (!isNextButtonDisabled()) {
const nextButton = document.querySelector('a.ui-paginator-next');
if (!nextButton) {
console.log('Next button not found');
break;
}
// Store current page keys for comparison
const lastPageKeys = new Set(seenKeys);
// Click next button
nextButton.click();
pageNumber++;
// Wait for page to load and retry if needed
let retries = 0;
let pageChanged = false;
while (retries < maxRetries) {
await new Promise(resolve => setTimeout(resolve, delay));
const currentPageResults = parseCurrentPage();
pageChanged = hasPageChanged(currentPageResults, lastPageKeys);
if (pageChanged) {
console.log(`Parsing page ${pageNumber}...`);
// Add only unique entries
let newEntries = 0;
currentPageResults.forEach(entry => {
const key = getEntryKey(entry);
if (!seenKeys.has(key)) {
seenKeys.add(key);
allResults.push(entry);
newEntries++;
}
});
console.log(`Page ${pageNumber}: Found ${currentPageResults.length} entries (${newEntries} new, ${allResults.length} unique total)`);
break;
} else {
retries++;
if (retries < maxRetries) {
console.log(`Page ${pageNumber}: Data not updated yet, retrying (${retries}/${maxRetries})...`);
}
}
}
if (!pageChanged) {
console.log(`Page ${pageNumber}: Data still not updated after ${maxRetries} retries, skipping...`);
}
}
console.log(`\nCompleted! Total unique entries: ${allResults.length}`);
return allResults;
})()- Save configuration > Run the scraper.
- The scraper returns the following output:
Result :
{
"data": [
{
"region": "Bavaria",
"court": "District court München VR 201131",
"companyName": "1.Volkswagen & Audi Club Mittenwald e.V.",
"location": "Mittenwald",
"status": "currently registered",
"documentTypes": [
"AD",
"CD",
"DK",
"UT",
"VÖ",
"SI"
]
},
{
"region": "Baden-Württemberg",
"court": "District court Stuttgart VR 381348",
"companyName": "VW-Audi Club Härten e.V.",
"location": "Kusterdingen",
"status": "currently registered",
"documentTypes": [
"AD",
"CD",
"HD",
"DK",
"UT",
"VÖ",
"SI"
],
"history": [
{
"name": "1.) VW-Audi Club Härten",
"location": "1.) Kusterdingen"
}
]
}
]
}