The DAR (Data Archive Request) Schema is an advanced file format designed to capture a more comprehensive dataset than traditional HAR files. By extending the HAR format with additional objects, DAR provides a richer, more detailed view of the web scraping and data collection process.
This guide will help you get started with DAR by walking you through the installation, basic usage, and integration of DAR tools into your data workflows.
- Installation
- Setting Up Your Environment
- Creating Your First DAR File
- Parsing and Analyzing DAR Files
- Integrating DAR in Your Scraping Workflow
- Next Steps
Start by cloning the DAR-Schema repository from GitHub:
git clone https://github.com/OpenBrand/DAR-Schema.git
cd DAR-SchemaThe DAR tools rely on Python, so ensure you have Python installed on your system. You can check if Python is installed by running:
python --versionIf Python is not installed, you can download it from Python's official website.
Next, install the required Python packages:
pip install -r requirements.txtThis command installs dependencies that are required for running the DAR conversion and parsing scripts.
After installation, verify that the tools are working correctly by running the help command for the DAR tools:
python tools/har_to_dar_converter.py --helpYou should see usage instructions, confirming that the environment is set up correctly.
Before creating a DAR file, you need a HAR file generated from a web scraping session. Most modern browsers and tools like Selenium can export network activity as a HAR file.
Here’s a simple example using Chrome DevTools:
- Open Chrome DevTools (
F12or right-click > Inspect). - Go to the Network tab.
- Perform the web interactions you want to capture.
- Right-click in the Network tab and select Save all as HAR with content.
Save this HAR file in your project directory.
Use the har_to_dar_converter.py script to convert your HAR file into a DAR file.
python tools/har_to_dar_converter.py input.har output.darThis command will create a DAR file (output.dar) from your HAR file (input.har), adding enhanced objects like renders and result.
Once you have your DAR file, you can use the dar_parser.py tool to read and analyze the data.
Run the dar_parser.py script to parse the DAR file and extract key information.
python tools/dar_parser.pyThe script will prompt you for the path to your DAR file and output a summary, including:
- Number of renders captured.
- Summary of scraping results.
- Number of HTTP request entries.
from tools.dar_parser import DARParser
# Initialize the parser with your DAR file
parser = DARParser('output.dar')
# Print a detailed summary of the DAR file
parser.print_dar_summary()DAR is especially useful when integrated into web scraping workflows, providing insights into how pages are loaded and data is collected. Here’s how you can incorporate DAR into your existing scraping setup.
Use Selenium or your preferred tool to navigate and interact with a webpage, capturing a HAR file as you go.
from selenium import webdriver
# Set up Selenium to capture network traffic
driver = webdriver.Chrome()
driver.get('https://example.com')
# Capture the session as HAR (specific to your setup)After capturing the HAR file, convert it to DAR to enhance the captured data with renders and results.
python tools/har_to_dar_converter.py session.har session.darUse the DAR file to gain insights into your scraping process, identify any errors, and understand how the data was collected.
# Parse and print the summary of your DAR file
parser = DARParser('session.dar')
parser.print_dar_summary()Now that you’ve created, parsed, and analyzed your first DAR file, you can explore further by:
- Experimenting with additional render states to capture more detailed data.
- Customizing the
resultobject to include specific metrics and error handling relevant to your workflow. - Integrating DAR files into automated data pipelines for monitoring and auditing purposes.
For more advanced usage, check out the Usage Examples document to explore various ways to utilize DAR in your projects.
DAR provides a powerful, flexible way to enhance your data collection processes. By extending HAR with additional render and result data, DAR helps you gain deeper insights and ensure reliable, comprehensive data capture.