⭐ Core SDK Files for Scripts

Required Files (Located in the Project Root Directory)

├── main.py
├── requirements.txt
├── input_schema.json
├── README.md
├── sdk.py
├── sdk_pb2.py
├── sdk_pb2_grpc.py

File Name	Description
`main.py`	Script entry file (execution entry point), uniformly named `main`
`requirements.txt`	Python dependency management file
`input_schema.json`	UI input form configuration file
`README.md`	Project documentation file
`sdk.py`	SDK core functionality
`sdk_pb2.py`	Enhanced data processing module
`sdk_pb2_grpc.py`	Network communication module

⭐ Core SDK Files for Scripts

📁 File Description

The following are three important SDK files that must be placed in the root directory of your script:

File Name	Core Functionality
`sdk.py`	Basic functional modules
`sdk_pb2.py`	Enhanced data processing module
`sdk_pb2_grpc.py`	Network communication module

These three files form the "toolkit" for your script, providing all core functionalities required for interacting with the mid-platform system and running web crawlers.

🔧 Core Function Usage Guide

1. Environment Parameter Retrieval - Get Configuration at Script Startup

When a script starts, you can pass external configuration parameters (e.g., target website URL, search keywords). Use the following method to retrieve these parameters:

# Retrieve all incoming parameters and return them as a dictionary.
parameters = SDK.Parameter.get_input_json_dict()

# Example: Assume a website URL and keywords are provided as input.
# return：{"website": "example.com", "keyword": "news"}

Use Case：For example, if you need to scrape data from different websites for different tasks, you can achieve this by passing different parameters without modifying the code.

2. Run log - record the script execution process

During script execution, you can log information at different levels. These logs will be displayed in the mid-platform interface, facilitating status monitoring and troubleshooting:

SDK.Log.debug("Connecting to the target website...")

SDK.Log.info("Successfully retrieved 10 news items.")

SDK.Log.warn("The network connection is slow, which may affect the data collection speed.")

SDK.Log.error("Unable to access the target website. Please check your network connection.")

Log Level Explanation:

debug: Most detailed debugging information, suitable for development
info: Normal process logging, recommended for key steps
warn: Warning messages indicating potential issues that don't stop execution
error: Error messages indicating critical issues requiring attention

3.Result Return - Send Scraped Data to Mid-Platform

After scraping data, return it to the mid-platform system by following these two steps:

Step 1: Set Table Headers (Mandatory First Step)

Before pushing actual data, define the table structure (similar to setting column headers in Excel):

# Define the table column structure.
headers = [
    {
        "label": "News title",      # column name
        "key": "title",          # column key
        "format": "text",        # Data type：text
    },
    {
        "label": "Publish time",
        "key": "publish_time",  
        "format": "text",
    },
    {
        "label": "News category",
        "key": "category",
        "format": "text",
    },
]

# set header
res = CafeSDK.Result.set_table_header(headers)

Field Explanation:：

label：Column header displayed in the table (user-visible, Chinese recommended)
key：Unique identifier for data (used in code, lowercase English with underscores recommended)
format：Data type supporting the following formats:
- "text"：Text/string
- "integer"：Integer
- "boolean"：Yes/No (Boolean)
- "array"：Array/list
- "object"：Object/dictionary

Step 2: Push Data Entry by Entry

After setting headers, start pushing scraped data:

# News date
news_data = [
    {"title": "New breakthroughs in artificial intelligence", "publish_time": "2023-10-01", "category": "Technology"},
    {"title": "Today's stock market trends", "publish_time": "2023-10-01", "category": "Finance"},
    # ... more
]

# push data
for i, news in enumerate(news_data):
    obj = {
        "title": news.get('title'),          
        "publish_time": news.get('publish_time'), 
        "category": news.get('category'),
    }

    res = CafeSDK.Result.push_data(obj)

    # Record the push results
    SDK.Log.info(f"Data item {i+1}：{news.get('title')}")

Important Reminders:：

The order of setting headers and pushing data can be reversed.
Keys in the pushed data dictionary must exactly match those defined in headers
Data must be pushed entry by entry (bulk push not supported)
It is recommended to log after each push to track execution progress

💡 Complete Code Example

The following is a full script example demonstrating the complete workflow from parameter retrieval to data return:

# 1. Retrieve startup parameters
config = SDK.Parameter.get_input_json_dict()
website = config.get("website", "Default website")
SDK.Log.info(f"Start collecting data from the website：{website}")

# 2. Set the result table header
headers = [
    {"label": "Title", "key": "title", "format": "text"},
    {"label": "Time", "key": "publish_time", "format": "text"},
    {"label": "Category", "key": "category", "format": "text"},
    {"label": "View count", "key": "view_count", "format": "integer"},
]
CafeSDK.Result.set_table_header(headers)

# 3. Simulate data collection (in practice, this could be web scraping code)
collected_data = [
    {"title": "Sample News 1", "publish_time": "2023-10-01 10:00", "category": "Technology", "view_count": 1000},
    {"title": "Sample News 2", "publish_time": "2023-10-01 11:00", "category": "Finance", "view_count": 500},
]

# 4. Push data
for data in collected_data:
    obj = {
        "title": data.get("title"),
        "publish_time": data.get("publish_time"),
        "category": data.get("category"),
        "view_count": data.get("view_count", 0),  # Provide a default value of 0
    }
    res = CafeSDK.Result.push_data(obj)

# 5. Completed
SDK.Log.info("Data collection task completed！")

⚠️ Common Issues & Notes

File Location： Ensure the three SDK files are placed in the script's root directory (folder containing the main file)
Import Method：Directly use SDK or CafeSDK in code to call related functions
键名一致：Keys used for pushing data must exactly match (including case) those defined in headers
错误处理： It is recommended to check return results for each SDK call, especially when pushing data

With the above functionalities, your script can seamlessly integrate with the mid-platform system, enabling flexible parameter configuration, transparent execution monitoring, and standardized return of scraped data.

⭐ Script Entry File (main.py)

💡 Code Example

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import asyncio
import os

from sdk import CafeSDK

async def run():
    try:
        # 1. Get params
        input_json_dict = CafeSDK.Parameter.get_input_json_dict()
        CafeSDK.Log.debug(f"params: {input_json_dict}")
        
        # 2. proxy configuration
        proxyDomain = "proxy-inner.cafescraper.com:6000"
        
        try:
            proxyAuth = os.environ.get("PROXY_AUTH")
            CafeSDK.Log.info(f"Proxy authentication information: {proxyAuth}")
        except Exception as e:
            CafeSDK.Log.error(f"Failed to retrieve proxy authentication information: {e}")
            proxyAuth = None
        
        # 3. Construct the proxy URL
        proxy_url = f"socks5://{proxyAuth}@{proxyDomain}" if proxyAuth else None
        CafeSDK.Log.info(f"Proxy address: {proxy_url}")
        
        # 4. TODO: Handle business logic
        url = input_json_dict.get('url')
        CafeSDK.Log.info(f"start deal URL: {url}")
        
        # Simulate business processing results
        result = {
            "url": url,
            "status": "success",
            "data": {
                "title": "Sample title",
                "content": "Sample content",
                # ... Other fields
            }
        }
        
        # 5. Push result data
        CafeSDK.Log.info(f"Processing result: {result}")
        CafeSDK.Result.push_data(result)
        
        # 6. Set the table headers (if table output is needed)
        headers = [
            {
                "label": "URL",
                "key": "url",
                "format": "text",
            },
            # ... Other table header configurations
        ]
        res = CafeSDK.Result.set_table_header(headers)
        
        CafeSDK.Log.info("Script execution completed")
        
    except Exception as e:
        CafeSDK.Log.error(f"Script execution error: {e}")
        error_result = {
            "error": str(e),
            "error_code": "500",
            "status": "failed"
        }
        CafeSDK.Result.push_data(error_result)
        raise

if __name__ == "__main__":
    asyncio.run(run())

Automated Data Scraping Script: Operation & Principle Guide

1. Script Overview

This is a template for an automation tool that acts like a "digital employee". It automatically opens specified web pages (e.g., social media pages), extracts the information you need, and organizes it into structured tables.

2. How It Works (4 Core Stages)

Stage 1: Receive Instructions (Retrieve Input Parameters)

Before starting the script, you provide instructions (e.g., which webpage URL to scrape, how many data entries to collect).

Stage 2: Anonymity Setup (Proxy Network Configuration)

To access overseas or restricted websites smoothly, the script automatically configures an "encrypted tunnel" (proxy server).

Stage 3: Automated Work (Business Logic Processing)

This is the core part of the script. Based on the provided URL, it automatically navigates to the target page and extracts information such as titles, content, and image URLs.

Stage 4: Report Results (Data Push & Table Generation)

After scraping, the script converts unstructured information into a standard format and saves it to the system. It also automatically designs table headers (e.g., Column 1: "URL", Column 2: "Content").

⭐Python Dependency Management File (requirements.txt)

This file lists all third-party Python packages and their version information required to run the script. The system will automatically read this file and install all specified dependencies to ensure the script runs correctly.

Dependency List Explanation

Below are the Python packages and their corresponding versions required for this project:

Example：

attrs==25.4.0
beautifulsoup4==4.14.2
certifi==2025.10.5
cffi==2.0.0
charset-normalizer==3.4.4
click==8.3.0
colorama==0.4.6
cssselect==1.3.0
DataRecorder==3.6.2
DownloadKit==2.0.7
DrissionPage==4.1.1.2
et_xmlfile
filelock

❗Important Notes

1. Version Number Explanation

Packages with version numbers（e.g. beautifulsoup4==4.14.2）：The system installs this exact version to ensure consistency with the development environment.
Packages without version numbers（e.g. et_xmlfile）：The system automatically installs the latest available version of the package.

2. Installation Notes

The system automatically handles installation of all dependencies – no manual operation required.
Installation may take several minutes depending on network speed and package size.
If installation fails, the system displays error messages – resolve issues as prompted.

3. Ensure Script Functionality

To ensure smooth execution of the Python script:

List all used third-party packages in this file.
Specify version numbers for critical core packages (e.g., package==version).
Regularly update dependencies to get new features and security fixes.

Frequently Asked Questions (FAQs)

Q：Why specify version numbers?

A：Specifying version numbers ensures the same package versions are used across different environments (development, testing, production), avoiding inconsistent program behavior or compatibility issues caused by version differences.

Q：What happens if no version number is specified?

A：Without a version number, the system installs the latest version of the package. This may cause incompatibility with the script, so it's recommended to fix versions for core dependencies.

Q：How to add new dependencies?

A：Simply add a new line in this file in the format "package==version" or "package", then re-upload the zip compressed file in the backend. The system will automatically install it on next run.

Q：What to do if installation fails?

A： Check network connectivity or try switching Python package mirrors. If issues persist, contact the system administrator.

Last Updated: Ensure this dependency list is updated promptly after modifying script functionality to reflect all new package dependencies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Required Files (Located in the Project Root Directory)

⭐ Core SDK Files for Scripts

📁 File Description

🔧 Core Function Usage Guide

1. Environment Parameter Retrieval - Get Configuration at Script Startup

2. Run log - record the script execution process

Log Level Explanation:

3.Result Return - Send Scraped Data to Mid-Platform

Step 1: Set Table Headers (Mandatory First Step)

Step 2: Push Data Entry by Entry

💡 Complete Code Example

⚠️ Common Issues & Notes

⭐ Script Entry File (main.py)

💡 Code Example

Automated Data Scraping Script: Operation & Principle Guide

1. Script Overview

2. How It Works (4 Core Stages)

Stage 1: Receive Instructions (Retrieve Input Parameters)

Stage 2: Anonymity Setup (Proxy Network Configuration)

Stage 3: Automated Work (Business Logic Processing)

Stage 4: Report Results (Data Push & Table Generation)

⭐Python Dependency Management File (requirements.txt)

Dependency List Explanation

❗Important Notes

1. Version Number Explanation

2. Installation Notes

3. Ensure Script Functionality

Frequently Asked Questions (FAQs)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
sdk.py		sdk.py
sdk_pb2.py		sdk_pb2.py
sdk_pb2_grpc.py		sdk_pb2_grpc.py

Folders and files

Latest commit

History

Repository files navigation

Required Files (Located in the Project Root Directory)

⭐ Core SDK Files for Scripts

📁 File Description

🔧 Core Function Usage Guide

1. Environment Parameter Retrieval - Get Configuration at Script Startup

2. Run log - record the script execution process

Log Level Explanation:

3.Result Return - Send Scraped Data to Mid-Platform

Step 1: Set Table Headers (Mandatory First Step)

Step 2: Push Data Entry by Entry

💡 Complete Code Example

⚠️ Common Issues & Notes

⭐ Script Entry File (main.py)

💡 Code Example

Automated Data Scraping Script: Operation & Principle Guide

1. Script Overview

2. How It Works (4 Core Stages)

Stage 1: Receive Instructions (Retrieve Input Parameters)

Stage 2: Anonymity Setup (Proxy Network Configuration)

Stage 3: Automated Work (Business Logic Processing)

Stage 4: Report Results (Data Push & Table Generation)

⭐Python Dependency Management File (requirements.txt)

Dependency List Explanation

❗Important Notes

1. Version Number Explanation

2. Installation Notes

3. Ensure Script Functionality

Frequently Asked Questions (FAQs)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages