diff --git a/.markdownlint.json b/.markdownlint.json
index 9d78e7f46f..e08d8929bb 100644
--- a/.markdownlint.json
+++ b/.markdownlint.json
@@ -11,6 +11,8 @@
"no-multiple-blanks": {
"maximum": 2
},
+ "table-column-style": false,
+ "ul-indent": false,
"no-space-in-emphasis": false,
"link-fragments": false,
"no-duplicate-heading": {
diff --git a/sources/academy/build-and-publish/actor-ideas/what_software_an_actor_can_be.md b/sources/academy/build-and-publish/actor-ideas/what_software_an_actor_can_be.md
index 2732510c95..6810e37d10 100644
--- a/sources/academy/build-and-publish/actor-ideas/what_software_an_actor_can_be.md
+++ b/sources/academy/build-and-publish/actor-ideas/what_software_an_actor_can_be.md
@@ -150,7 +150,6 @@ Any repetitive job matching the following criteria might be suitable for turning
If you look closely, you'll start seeing opportunities for new Actors everywhere. Be creative!
-
## Use the Actor ideas page
The [Actor ideas](https://apify.com/ideas) page is where you can find inspiration for new Actors sourced from the Apify community.
@@ -166,20 +165,16 @@ Build and publish new tools on Apify and have multiple chances to win big prizes
:::
1. _Visit_ [apify.com/ideas](https://apify.com/ideas) to find ideas that interest you. Look for ideas that align with your skills.
+2. _Select an Actor idea_: Review the details and requirements. Check the status—if it's marked **Open to develop**, you can start building.
+3. _Build your Actor_: Develop your Actor based on the idea. You don't need to notify Apify during development.
+4. _Prepare for launch_: Ensure your Actor meets quality standards and has a comprehensive README with installation instructions, usage details, and examples.
+5. _Publish your Actor_: Deploy your Actor on Apify Store and make it live.
+6. _Monitor and optimize_: Track your Actor's performance and user feedback. Make improvements to keep your Actor current.
-1. _Select an Actor idea_: Review the details and requirements. Check the status—if it's marked **Open to develop**, you can start building.
-
-1. _Build your Actor_: Develop your Actor based on the idea. You don't need to notify Apify during development.
-
-1. _Prepare for launch_: Ensure your Actor meets quality standards and has a comprehensive README with installation instructions, usage details, and examples.
-
-1. _Publish your Actor_: Deploy your Actor on Apify Store and make it live.
-
-
-
-1. _Monitor and optimize_: Track your Actor's performance and user feedback. Make improvements to keep your Actor current.
+
-
+
#### Can the Actor's meta description and description be the same?
Yes, they can, as long as they have the same (shorter) length (under 150 characters). But they can also be different - there's no harm in that.
diff --git a/sources/academy/build-and-publish/apify-store-basics/how_actor_monetization_works.md b/sources/academy/build-and-publish/apify-store-basics/how_actor_monetization_works.md
index a7befdf3e7..907ede7c1d 100644
--- a/sources/academy/build-and-publish/apify-store-basics/how_actor_monetization_works.md
+++ b/sources/academy/build-and-publish/apify-store-basics/how_actor_monetization_works.md
@@ -28,26 +28,26 @@ Monetizing your Actor on the Apify platform involves several key steps:

- _How it works_: you charge users based on specific events triggered programmatically by your Actor's code. You earn 80% of the revenue minus platform usage costs.
-- - _Profit calculation_: `profit = (0.8 * revenue) - platform usage costs`
+- _Profit calculation_: `profit = (0.8 * revenue) - platform usage costs`
- _Event cost example_: you set the following events for your Actor:
- - `Actor start per 1 GB of memory` at $0.005
- - `Pages scraped` at $0.002
- - `Page opened with residential proxy` at $0.002 - this is on top of `Pages scraped`
- - `Page opened with a browser` at $0.002 - this is on top of `Pages scraped`
+ - `Actor start per 1 GB of memory` at $0.005
+ - `Pages scraped` at $0.002
+ - `Page opened with residential proxy` at $0.002 - this is on top of `Pages scraped`
+ - `Page opened with a browser` at $0.002 - this is on top of `Pages scraped`
- _Example_:
- - User A:
- - Started the Actor with 10GB of memory = $0.05
- - Scraped 1,000 pages = $2.00
- - 500 of those were scraped using residential proxy = $1.00
- - 300 of those were scraped using browser = $0.60
- - This comes up to $3.65 of total revenue
- - User B:
- - Started the Actor with 5GB of memory = $0.025
- - Scraped 500 pages = $1.00
- - 200 of those were scraped using residential proxy = $0.40
- - 100 of those were scraped using browser = $0.20
- - This comes up to $1.625 of total revenue
- - That means if platform usage costs are $0.365 for user A and $0.162 for user B your profit is $4.748
+ - User A:
+ - Started the Actor with 10GB of memory = $0.05
+ - Scraped 1,000 pages = $2.00
+ - 500 of those were scraped using residential proxy = $1.00
+ - 300 of those were scraped using browser = $0.60
+ - This comes up to $3.65 of total revenue
+ - User B:
+ - Started the Actor with 5GB of memory = $0.025
+ - Scraped 500 pages = $1.00
+ - 200 of those were scraped using residential proxy = $0.40
+ - 100 of those were scraped using browser = $0.20
+ - This comes up to $1.625 of total revenue
+ - That means if platform usage costs are $0.365 for user A and $0.162 for user B your profit is $4.748
:::info Pay-per-event details
@@ -62,11 +62,11 @@ If you want more details about PPE pricing, refer to our [PPE documentation](/pl
- _How it works_: you charge users based on the number of results your Actor generates. You earn 80% of the revenue minus platform usage costs.
- _Profit calculation_: `profit = (0.8 * revenue) - platform usage costs`
- _Cost breakdown_:
- - Compute unit: $0.3 per CU
- - Residential proxies: $13 per GB
- - SERPs proxy: $3 per 1,000 SERPs
- - Data transfer (external): $0.20 per GB
- - Dataset storage: $1 per 1,000 GB-hours
+ - Compute unit: $0.3 per CU
+ - Residential proxies: $13 per GB
+ - SERPs proxy: $3 per 1,000 SERPs
+ - Data transfer (external): $0.20 per GB
+ - Dataset storage: $1 per 1,000 GB-hours
- _Example_: you set a price of $1 per 1,000 results. Two users generate 50,000 and 20,000 results, paying $50 and $20, respectively. If the platform usage costs are $5 and $2, your profit is $49.
:::info Pay-per-result details
@@ -81,9 +81,9 @@ If you want more details about PPR pricing, refer to our [PPR documentation](/pl
- _How it works_: you offer a free trial period and set a monthly fee. Users on Apify paid plans can continue using the Actor after the trial. You earn 80% of the monthly rental fees.
- _Example_: you set a 7-day free trial and $30/month rental. If 3 users start using your Actor:
- - 1st user on a paid plan pays $30 after the trial (you earn $24).
- - 2nd user starts their trial but pays next month.
- - 3rd user on a free plan finishes the trial without upgrading to a paid plan and can’t use the Actor further.
+ - 1st user on a paid plan pays $30 after the trial (you earn $24).
+ - 2nd user starts their trial but pays next month.
+ - 3rd user on a free plan finishes the trial without upgrading to a paid plan and can’t use the Actor further.
:::info Rental pricing details
@@ -160,7 +160,7 @@ Example of useful pricing estimates from the **Analytics** tab:
:::tip Use emails!
-📫 Don't forget to set an email sequence to warn and remind your users about pricing changes. Learn more about emailing your users here: [Emails to Actor users]
+📫 Don't forget to set an email sequence to warn and remind your users about pricing changes. Learn more about emailing your users here: [Emails to Actor users]
:::
@@ -172,4 +172,3 @@ Example of useful pricing estimates from the **Analytics** tab:
- Watch our webinar on how to [build, publish and monetize Actors](https://www.youtube.com/watch?v=4nxStxC1BJM)
- Read a blog post from our CEO on the [reasoning behind monetizing Actors](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/)
- Learn about the [Creator plan](https://apify.com/pricing/creator-plan), which allows you to create and freely test your own Actors for $1
-
diff --git a/sources/academy/build-and-publish/apify-store-basics/how_to_create_actor_readme.md b/sources/academy/build-and-publish/apify-store-basics/how_to_create_actor_readme.md
index 6b0d9d2e12..3af560cbd9 100644
--- a/sources/academy/build-and-publish/apify-store-basics/how_to_create_actor_readme.md
+++ b/sources/academy/build-and-publish/apify-store-basics/how_to_create_actor_readme.md
@@ -173,7 +173,6 @@ If you want to add snippets of code anywhere in your README, you can use [Carbo
If you need quick Markdown guidance, check out [https://www.markdownguide.org/cheat-sheet/](https://www.markdownguide.org/cheat-sheet/)
-
## README and SEO
Your README is your landing page.
@@ -231,8 +230,8 @@ Learn about [How to create a great input schema](/academy/actor-marketing-playbo
- Business use cases
- Link to a success story, a business use case, or a blog post.
3. How to scrape (target site)
- - Link to "How to…" blogs, if one exists (or suggest one if it doesn't)
- - Add a video tutorial or gif from an ideal Actor run.
+ - Link to "How to…" blogs, if one exists (or suggest one if it doesn't)
+ - Add a video tutorial or gif from an ideal Actor run.
:::tip Embedding YouTube videos
@@ -246,13 +245,13 @@ For better user experience, Apify Console automatically renders every YouTube UR
- This can be used as a boilerplate text for the legal section, but you should use your own judgment and also customize it with the site name.
> Our scrapers are ethical and do not extract any private user data, such as email addresses, gender, or location. They only extract what the user has chosen to share publicly. We therefore believe that our scrapers, when used for ethical purposes by Apify users, are safe. However, you should be aware that your results could contain personal data. Personal data is protected by the GDPR in the European Union and by other regulations around the world. You should not scrape personal data unless you have a legitimate reason to do so. If you're unsure whether your reason is legitimate, consult your lawyers. You can also read our blog post on the legality of web scraping
- >
+
2. Input
- Each Actor detail page has an input tab, so you just need to refer to that. If you like, you can add a screenshot showing the user what the input fields will look like.
- This is an example of how to refer to the input tab:
> Twitter Scraper has the following input options. Click on the input tab for more information.
- >
+
3. Output
- Mention "You can download the dataset extracted by (Actor name) in various formats such as JSON, HTML, CSV, or Excel.”
- Add a simplified JSON dataset example, like here https://apify.com/compass/crawler-google-places#output-example
diff --git a/sources/academy/build-and-publish/apify-store-basics/importance_of_actor_url.md b/sources/academy/build-and-publish/apify-store-basics/importance_of_actor_url.md
index 8a7fdf692d..37a1ada246 100644
--- a/sources/academy/build-and-publish/apify-store-basics/importance_of_actor_url.md
+++ b/sources/academy/build-and-publish/apify-store-basics/importance_of_actor_url.md
@@ -84,9 +84,10 @@ In Console. Open the **Actor's page**, then click on **…** in the top right co

-
## FAQ
+
+
#### Can Actor URL be different from Actor name?
Yes. While they can be the same, they don’t have to be. For the best user experience, keeping them identical is recommended, but you can experiment with the Actor's name. Just avoid changing the Actor URL.
@@ -106,4 +107,3 @@ Yes, you can. But it will most likely lower your chances of being noticed by Goo
#### Does changing my Apify account name affect the Actor URL?
Yes. If you're changing from _justanotherdev/pentagon-scraper_ to _dev/pentagon-scraper_, it counts as a new page. Essentially, the consequences are the same as after changing the technical name of the Actor.
-
diff --git a/sources/academy/build-and-publish/apify-store-basics/name_your_actor.md b/sources/academy/build-and-publish/apify-store-basics/name_your_actor.md
index 05013561a4..8707634d04 100644
--- a/sources/academy/build-and-publish/apify-store-basics/name_your_actor.md
+++ b/sources/academy/build-and-publish/apify-store-basics/name_your_actor.md
@@ -19,10 +19,10 @@ Ideally, you should choose a name that clearly shows what your Actor does and in
Your Actor's name consists of four parts: actual name, SEO name, URL, and GitHub repository name.
- Actor name (name shown in Apify Store), e.g. _Booking Scraper_.
- - Actor SEO name (name shown on Google Search, optional), e.g. _Booking.com Hotel Data Scraper_.
- - If the SEO name is not set, the Actor name will be the default name shown on Google.
+ - Actor SEO name (name shown on Google Search, optional), e.g. _Booking.com Hotel Data Scraper_.
+ - If the SEO name is not set, the Actor name will be the default name shown on Google.
- Actor URL (technical name), e.g. _booking-scraper_.
- - More on it on [Importance of Actor URL](/academy/actor-marketing-playbook/actor-basics/importance-of-actor-url) page.
+ - More on it on [Importance of Actor URL](/academy/actor-marketing-playbook/actor-basics/importance-of-actor-url) page.
- GitHub repository name (best to keep it similar to the other ones, for convenience), e.g. _actor-booking-scraper_.
## Actor name
diff --git a/sources/academy/build-and-publish/how-to-build/how_to_create_a_great_input_schema.md b/sources/academy/build-and-publish/how-to-build/how_to_create_a_great_input_schema.md
index 892c30b49a..10e482d542 100644
--- a/sources/academy/build-and-publish/how-to-build/how_to_create_a_great_input_schema.md
+++ b/sources/academy/build-and-publish/how-to-build/how_to_create_a_great_input_schema.md
@@ -66,57 +66,56 @@ You can see the full list of elements and their technical characteristics in [Do
Unfortunately, when it comes to UX, there's only so much you can achieve armed with HTML alone. Here are the best elements to focus on, along with some best practices for using them effectively:
- **`description` at the top**
- - As the first thing users see, the description needs to provide crucial information and a sense of reassurance if things go wrong. Key points to mention: the easiest way to try the Actor, links to a guide, and any disclaimers or other similar Actors to try.
+ - As the first thing users see, the description needs to provide crucial information and a sense of reassurance if things go wrong. Key points to mention: the easiest way to try the Actor, links to a guide, and any disclaimers or other similar Actors to try.
- 
+ 
- - Descriptions can include multiple paragraphs. If you're adding a link, it’s best to use the `target_blank` property so your user doesn’t lose the original Actor page when clicking.
+ - Descriptions can include multiple paragraphs. If you're adding a link, it’s best to use the `target_blank` property so your user doesn’t lose the original Actor page when clicking.
- **`title` of the field (regular bold text)**
- - This is the default way to name a field.
- - Keep it brief. The user’s flow should be 1. title → 2. tooltip → 3. link in the tooltip. Ideally, the title alone should provide enough clarity. However, avoid overloading the title with too much information. Instead, make the title as concise as possible, expand details in the tooltip, and include a link in the tooltip for full instructions.
+ - This is the default way to name a field.
+ - Keep it brief. The user’s flow should be 1. title → 2. tooltip → 3. link in the tooltip. Ideally, the title alone should provide enough clarity. However, avoid overloading the title with too much information. Instead, make the title as concise as possible, expand details in the tooltip, and include a link in the tooltip for full instructions.
- 
+ 
- **`prefill`, the default input**
- - this is your chance to show rather than tell
- - Keep the **prefilled number** low. Set it to 0 if it's irrelevant for a default run.
- - Make the **prefilled text** example simple and easy to remember.
- - If your Actor accepts various URL formats, add a few different **prefilled URLs** to show that possibility.
- - Use the **prefilled date** format that the user is expected to follow. This way, they can learn the correct format without needing to check the tooltip.
- - There’s also a type of field that looks like a prefill but isn’t — usually a `default` field. It’s not counted as actual input but serves as a mock input to show users what to type or paste. It is gray and disappears after clicking on it. Use this to your advantage.
+ - this is your chance to show rather than tell
+ - Keep the **prefilled number** low. Set it to 0 if it's irrelevant for a default run.
+ - Make the **prefilled text** example simple and easy to remember.
+ - If your Actor accepts various URL formats, add a few different **prefilled URLs** to show that possibility.
+ - Use the **prefilled date** format that the user is expected to follow. This way, they can learn the correct format without needing to check the tooltip.
+ - There’s also a type of field that looks like a prefill but isn’t — usually a `default` field. It’s not counted as actual input but serves as a mock input to show users what to type or paste. It is gray and disappears after clicking on it. Use this to your advantage.
- **toggle**
- - The toggle is a boolean field. A boolean field represents a yes/no choice.
- - How would you word this toggle: **Skip closed places** or **Scrape open places only**? And should the toggle be enabled or disabled by default?
+ - The toggle is a boolean field. A boolean field represents a yes/no choice.
+ - How would you word this toggle: **Skip closed places** or **Scrape open places only**? And should the toggle be enabled or disabled by default?
- 
-
- - You have to consider this when you're choosing how to word the toggle button and which choice to set up as the default. If you're making this more complex than it's needed (e.g. by using negation as the ‘yes’ choice), you're increasing your user's cognitive load. You also might get them to receive way less, or way more, data than they need from a default run.
- - In our example, we assume the default user wants to scrape all places but still have the option to filter out closed ones. However, they have to make that choice consciously, so we keep the toggle disabled by default. If the toggle were enabled by default, users might not notice it, leading them to think the tool isn't working properly when it returns fewer results than expected.
+ 
+ - You have to consider this when you're choosing how to word the toggle button and which choice to set up as the default. If you're making this more complex than it's needed (e.g. by using negation as the ‘yes’ choice), you're increasing your user's cognitive load. You also might get them to receive way less, or way more, data than they need from a default run.
+ - In our example, we assume the default user wants to scrape all places but still have the option to filter out closed ones. However, they have to make that choice consciously, so we keep the toggle disabled by default. If the toggle were enabled by default, users might not notice it, leading them to think the tool isn't working properly when it returns fewer results than expected.
- **sections or `sectionCaption` (BIG bold text) and `sectionDescription`**
- - A section looks like a wrapped toggle list.
+ - A section looks like a wrapped toggle list.
- 
+ 
- - It is useful to section off non-default ways of input or extra features. If your tool is complex, don't leave all fields in the first section. Just group them by topic and section them off (see the screenshot above ⬆️)
- - You can add a description to every section. Use `sectionDescription` only if you need to provide extra information about the section (see the screenshot below ⬇️.
- - sometimes `sectionDescription` is used as a space for disclaimers so the user is informed of the risks from the outset instead of having to click on the tooltip.
+ - It is useful to section off non-default ways of input or extra features. If your tool is complex, don't leave all fields in the first section. Just group them by topic and section them off (see the screenshot above ⬆️)
+ - You can add a description to every section. Use `sectionDescription` only if you need to provide extra information about the section (see the screenshot below ⬇️.
+ - sometimes `sectionDescription` is used as a space for disclaimers so the user is informed of the risks from the outset instead of having to click on the tooltip.
- 
+ 
- tooltips or `description` to the title
- - To see the tooltip's text, the user needs to click on the `?` icon.
- - This is your space to explain the title and what's going to happen in that field: any terminology, referrals to other fields of the tool, examples that don't fit the prefill, or caveats can be detailed here. Using HTML, you can add links, line breaks, code, and other regular formatting here. Use this space to add links to relevant guides, video tutorials, screenshots, issues, or readme parts if needed.
- - Wording in titles vs. tooltips. Titles are usually nouns. They have a neutral tone and simply inform on what content this field is accepting (**Usernames**).
- - Tooltips to those titles are usually verbs in the imperative that tell the user what to do (_Add, enter, use_).
- - This division is not set in stone, but the reason why the tooltip is an imperative verb is because, if the user is clicking on the tooltip, we assume they are looking for clarifications or instructions on what to do.
+ - To see the tooltip's text, the user needs to click on the `?` icon.
+ - This is your space to explain the title and what's going to happen in that field: any terminology, referrals to other fields of the tool, examples that don't fit the prefill, or caveats can be detailed here. Using HTML, you can add links, line breaks, code, and other regular formatting here. Use this space to add links to relevant guides, video tutorials, screenshots, issues, or readme parts if needed.
+ - Wording in titles vs. tooltips. Titles are usually nouns. They have a neutral tone and simply inform on what content this field is accepting (**Usernames**).
+ - Tooltips to those titles are usually verbs in the imperative that tell the user what to do (_Add, enter, use_).
+ - This division is not set in stone, but the reason why the tooltip is an imperative verb is because, if the user is clicking on the tooltip, we assume they are looking for clarifications or instructions on what to do.
- 
+ 
- emojis (visual component)
- - Use them to attract attention or as visual shortcuts. Use emojis consistently to invoke a user's iconic memory. The visual language should match across the whole input schema (and README) so the user can understand what section or field is referred to without reading the whole title.
- - Don't overload the schema with emojis. They attract attention, so you need to use them sparingly.
+ - Use them to attract attention or as visual shortcuts. Use emojis consistently to invoke a user's iconic memory. The visual language should match across the whole input schema (and README) so the user can understand what section or field is referred to without reading the whole title.
+ - Don't overload the schema with emojis. They attract attention, so you need to use them sparingly.
:::tip
@@ -165,7 +164,7 @@ The version above was the improved input schema. Here's what this tool's input s
- _User feedback_. If they're asking obvious things, complaining, or consistently making silly mistakes with input, take notes. Feedback from users can help you understand their experience and identify areas for improvement.
- _High churn rates_. If your users are trying your tool but quickly abandon it, this is a sign they are having difficulties with your schema.
-- _Input Schema Viewer_. Write your base schema in any code editor, then copy the file and put it into [**Input Schema Viewer](https://console.apify.com/actors/UHTe5Bcb4OUEkeahZ/source).** This tool should help you visualize your Input Schema before you add it to your Actor and build it. Seeing how your edits look in Apify Console right away will make the process of editing the fields in code easier.
+- _Input Schema Viewer_. Write your base schema in any code editor, then copy the file and put it into [**Input Schema Viewer**](https://console.apify.com/actors/UHTe5Bcb4OUEkeahZ/source). This tool should help you visualize your Input Schema before you add it to your Actor and build it. Seeing how your edits look in Apify Console right away will make the process of editing the fields in code easier.
## Resources
diff --git a/sources/academy/build-and-publish/how-to-build/index.md b/sources/academy/build-and-publish/how-to-build/index.md
index 5670747a3b..222d376fb2 100644
--- a/sources/academy/build-and-publish/how-to-build/index.md
+++ b/sources/academy/build-and-publish/how-to-build/index.md
@@ -21,14 +21,14 @@ At Apify, we try to make building web scraping and automation straightforward. Y
Since scraping and automation come in various forms, we decided to build not just one, but _six_ scrapers. This way, you can always pick the right tool for the job. Let's take a look at each particular tool and its advantages and disadvantages.
-| Scraper | Technology | Advantages | Disadvantages | Best for |
-| --- | --- | --- | --- | --- |
-| 🌐 Web Scraper | Headless Chrome Browser | Simple, fully JavaScript-rendered pages | Executes only client-side JavaScript | Websites with heavy client-side JavaScript |
-| 👐 Puppeteer Scraper | Headless Chrome Browser | Powerful Puppeteer functions, executes both server-side and client-side JavaScript | More complex | Advanced scraping with client/server-side JS |
-| 🎭 Playwright Scraper | Cross-browser support with Playwright library | Cross-browser support, executes both server-side and client-side JavaScript | More complex | Cross-browser scraping with advanced features |
-| 🍩 Cheerio Scraper | HTTP requests + Cheerio parser (JQuery-like for servers) | Simple, fast, cost-effective | Pages may not be fully rendered (lacks JavaScript rendering), executes only server-side JavaScript | High-speed, cost-effective scraping |
-| ⚠️ JSDOM Scraper | JSDOM library (Browser-like DOM API) | + Handles client-side JavaScript + Faster than full-browser solutions + Ideal for light scripting | Not for heavy dynamic JavaScript, executes server-side code only, depends on pre-installed NPM modules | Speedy scraping with light client-side JS |
-| 🍲 BeautifulSoup Scraper | Python-based, HTTP requests + BeautifulSoup parser | Python-based, supports recursive crawling and URL lists | No full-featured web browser, not suitable for dynamic JavaScript-rendered pages | Python users needing simple, recursive crawling |
+| Scraper | Technology | Advantages | Disadvantages | Best for |
+| ------------------------ | -------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | ----------------------------------------------- |
+| 🌐 Web Scraper | Headless Chrome Browser | Simple, fully JavaScript-rendered pages | Executes only client-side JavaScript | Websites with heavy client-side JavaScript |
+| 👐 Puppeteer Scraper | Headless Chrome Browser | Powerful Puppeteer functions, executes both server-side and client-side JavaScript | More complex | Advanced scraping with client/server-side JS |
+| 🎭 Playwright Scraper | Cross-browser support with Playwright library | Cross-browser support, executes both server-side and client-side JavaScript | More complex | Cross-browser scraping with advanced features |
+| 🍩 Cheerio Scraper | HTTP requests + Cheerio parser (JQuery-like for servers) | Simple, fast, cost-effective | Pages may not be fully rendered (lacks JavaScript rendering), executes only server-side JavaScript | High-speed, cost-effective scraping |
+| ⚠️ JSDOM Scraper | JSDOM library (Browser-like DOM API) | + Handles client-side JavaScript + Faster than full-browser solutions + Ideal for light scripting | Not for heavy dynamic JavaScript, executes server-side code only, depends on pre-installed NPM modules | Speedy scraping with light client-side JS |
+| 🍲 BeautifulSoup Scraper | Python-based, HTTP requests + BeautifulSoup parser | Python-based, supports recursive crawling and URL lists | No full-featured web browser, not suitable for dynamic JavaScript-rendered pages | Python users needing simple, recursive crawling |
### How do I choose the right universal web scraper to start with?
@@ -41,7 +41,6 @@ Since scraping and automation come in various forms, we decided to build not jus
- Use ⚠️ [JSDOM Scraper](https://apify.com/apify/jsdom-scraper) for lightweight, speedy scraping with minimal client-side JavaScript requirements.
- Use 🍲 [BeautifulSoup Scraper](https://apify.com/apify/beautifulsoup-scraper) for Python-based scraping, especially with recursive crawling and processing URL lists.
-
To make it easier, here's a short questionnaire that guides you on selecting the best scraper based on your specific use case:
@@ -76,7 +75,6 @@ This should help you navigate through the options and choose the right scraper b
-
📚 Resources:
- How to use [Web Scraper](https://www.youtube.com/watch?v=5kcaHAuGxmY) to scrape any website
@@ -87,11 +85,10 @@ This should help you navigate through the options and choose the right scraper b
Similar to our universal scrapers, our [code templates](https://apify.com/templates) also provide a quick start for developing web scrapers, automation scripts, and testing tools. Built on popular libraries like BeautifulSoup for Python or Playwright for JavaScript, they save time on setup, allowing you to focus on customization. Though they require more coding than universal scrapers, they're ideal for those who want a flexible foundation while still needing room to tailor their solutions.
-| Code template | Supported libraries | Purpose | Pros | Cons |
-| --- | --- | --- | --- | --- |
-| 🐍 Python | Requests, BeautifulSoup, Scrapy, Selenium, Playwright | Creating scrapers Automation Testing tools | - Simplifies setup - Supports major Python libraries | - Requires more manual coding (than universal scrapers)- May be restrictive for complex tasks |
-| ☕️ JavaScript | Playwright, Selenium, Cheerio, Cypress, LangChain | Creating scrapers Automation Testing tools | - Eases development with pre-set configurations - Flexibility with JavaScript and TypeScript | - Requires more manual coding (than universal scrapers)- May be restrictive for tasks needing full control |
-
+| Code template | Supported libraries | Purpose | Pros | Cons |
+| ------------- | ----------------------------------------------------- | ------------------------------------------ | -------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
+| 🐍 Python | Requests, BeautifulSoup, Scrapy, Selenium, Playwright | Creating scrapers Automation Testing tools | - Simplifies setup - Supports major Python libraries | - Requires more manual coding (than universal scrapers)- May be restrictive for complex tasks |
+| ☕️ JavaScript | Playwright, Selenium, Cheerio, Cypress, LangChain | Creating scrapers Automation Testing tools | - Eases development with pre-set configurations - Flexibility with JavaScript and TypeScript | - Requires more manual coding (than universal scrapers)- May be restrictive for tasks needing full control |
📚 Resources:
@@ -128,7 +125,6 @@ While these tools are distinct, they can be combined. For example, you can use C
- Webinar on how to use [Crawlee Python](https://www.youtube.com/watch?v=ip8Ii0eLfRY)
- Introduction to Apify's [Python SDK](https://www.youtube.com/watch?v=C8DmvJQS3jk)
-
## Code templates vs. universal scrapers vs. libraries
Basically, the choice here depends on how much flexibility you need and how much coding you're willing to do. More flexibility → more coding.
diff --git a/sources/academy/build-and-publish/how-to-build/running_a_web_server.md b/sources/academy/build-and-publish/how-to-build/running_a_web_server.md
index 8a4eaecc86..985f772eb9 100644
--- a/sources/academy/build-and-publish/how-to-build/running_a_web_server.md
+++ b/sources/academy/build-and-publish/how-to-build/running_a_web_server.md
@@ -62,11 +62,7 @@ Now we need to read the following environment variables:
- **APIFY_DEFAULT_KEY_VALUE_STORE_ID** is the ID of the default key-value store of this Actor where we can store screenshots.
```js
-const {
- APIFY_CONTAINER_PORT,
- APIFY_CONTAINER_URL,
- APIFY_DEFAULT_KEY_VALUE_STORE_ID,
-} = process.env;
+const { APIFY_CONTAINER_PORT, APIFY_CONTAINER_URL, APIFY_DEFAULT_KEY_VALUE_STORE_ID } = process.env;
```
Next, we'll create an array of the processed URLs where the **n**th URL has its screenshot stored under the key **n**.jpg in the key-value store.
@@ -133,7 +129,9 @@ app.post('/add-url', async (req, res) => {
await browser.close();
// ... save screenshot to key-value store and add URL to processedUrls.
- await Actor.setValue(`${processedUrls.length}.jpg`, screenshot, { contentType: 'image/jpeg' });
+ await Actor.setValue(`${processedUrls.length}.jpg`, screenshot, {
+ contentType: 'image/jpeg',
+ });
processedUrls.push(url);
res.redirect('/');
@@ -162,11 +160,7 @@ const app = express();
app.use(express.json());
app.use(express.urlencoded({ extended: true }));
-const {
- APIFY_CONTAINER_PORT,
- APIFY_CONTAINER_URL,
- APIFY_DEFAULT_KEY_VALUE_STORE_ID,
-} = process.env;
+const { APIFY_CONTAINER_PORT, APIFY_CONTAINER_URL, APIFY_DEFAULT_KEY_VALUE_STORE_ID } = process.env;
const processedUrls = [];
@@ -219,7 +213,9 @@ app.post('/add-url', async (req, res) => {
await browser.close();
// ... save screenshot to key-value store and add URL to processedUrls.
- await Actor.setValue(`${processedUrls.length}.jpg`, screenshot, { contentType: 'image/jpeg' });
+ await Actor.setValue(`${processedUrls.length}.jpg`, screenshot, {
+ contentType: 'image/jpeg',
+ });
processedUrls.push(url);
res.redirect('/');
diff --git a/sources/academy/build-and-publish/interacting-with-users/emails_to_actor_users.md b/sources/academy/build-and-publish/interacting-with-users/emails_to_actor_users.md
index 663628547a..ba6c3ad03b 100644
--- a/sources/academy/build-and-publish/interacting-with-users/emails_to_actor_users.md
+++ b/sources/academy/build-and-publish/interacting-with-users/emails_to_actor_users.md
@@ -38,15 +38,15 @@ Our general policy is to avoid spamming users with unnecessary emails. We contac
New filter, faster scraping, changes in input schema, in output schema, a new Integration, etc.
->✉️ 🏙️ Introducing Deep city search for Tripadvisor scrapers
+> ✉️ 🏙️ Introducing Deep city search for Tripadvisor scrapers
>
->Hi,
+> Hi,
>
->Tired of Tripadvisor's 3000 hotels-per-search limit? We've got your back. Say hello to our latest baked-in feature: Deep city search. Now, to get all results from a country-wide search you need to just set Max search results above 3000, and watch the magic happen.
+> Tired of Tripadvisor's 3000 hotels-per-search limit? We've got your back. Say hello to our latest baked-in feature: Deep city search. Now, to get all results from a country-wide search you need to just set Max search results above 3000, and watch the magic happen.
>
->A bit of context: while Tripadvisor never limited the search for restaurants or attractions, hotel search was a different case; it always capped at 3000. Our smart search is designed to overcome that limit by including every city within your chosen location. We scrape hotels from each one, ensuring no hidden gems slip through the cracks. This feature is available for [Tripadvisor Scraper](https://console.apify.com/actors/dbEyMBriog95Fv8CW/console) and [Tripadvisor Hotels Scraper](https://console.apify.com/actors/qx7G70MC4WBE273SM/console).
+> A bit of context: while Tripadvisor never limited the search for restaurants or attractions, hotel search was a different case; it always capped at 3000. Our smart search is designed to overcome that limit by including every city within your chosen location. We scrape hotels from each one, ensuring no hidden gems slip through the cracks. This feature is available for [Tripadvisor Scraper](https://console.apify.com/actors/dbEyMBriog95Fv8CW/console) and [Tripadvisor Hotels Scraper](https://console.apify.com/actors/qx7G70MC4WBE273SM/console).
>
->Get ready for an unbeatable hotel-hunting experience. Give it a spin, and let us know what you think!
+> Get ready for an unbeatable hotel-hunting experience. Give it a spin, and let us know what you think!
Introduce and explain the features, add a screenshot of a feature if it will show in the input schema, and ask for feedback.
@@ -54,15 +54,15 @@ Introduce and explain the features, add a screenshot of a feature if it will sho
A common situation in web scraping that's out of your control.
->✉️ 📣 Output changes for Facebook Ads Scraper
+> ✉️ 📣 Output changes for Facebook Ads Scraper
>
->Hi,
+> Hi,
>
->We've got some news regarding your favorite Actor – [Facebook Ads Scraper](https://console.apify.com/actors/JJghSZmShuco4j9gJ/console). Recently, Facebook Ads have changed their data format. To keep our Actor running smoothly, we'll be adapting to these changes by slightly tweaking the Actor Output. Don't worry; it's a breeze! Some of the output data might just appear under new titles.
+> We've got some news regarding your favorite Actor – [Facebook Ads Scraper](https://console.apify.com/actors/JJghSZmShuco4j9gJ/console). Recently, Facebook Ads have changed their data format. To keep our Actor running smoothly, we'll be adapting to these changes by slightly tweaking the Actor Output. Don't worry; it's a breeze! Some of the output data might just appear under new titles.
>
->This change will take place on October 10; please** **make sure to remap your integrations accordingly.
+> This change will take place on October 10; please make sure to remap your integrations accordingly.
>
->Need a hand or have questions? Our support team is just one friendly message away.
+> Need a hand or have questions? Our support team is just one friendly message away.
Inform users about the reason for changes and how the changes impact them and the Actor + give them a date when the change takes effect.
@@ -70,32 +70,32 @@ Inform users about the reason for changes and how the changes impact them and th
Email 1 (before the change, warning about deprecation).
->✉️ 🛎 Changes to Booking Scraper
+> ✉️ 🛎 Changes to Booking Scraper
>
->Hi,
+> Hi,
>
->We’ve got news regarding the Booking scraper you have been using. This change will happen in two steps:
+> We’ve got news regarding the Booking scraper you have been using. This change will happen in two steps:
>
->1. On September 22, we will deprecate it, i.e., new users will not be able to find it in Store. You will still be able to use it though.
->2. At the end of October, we will unpublish this Actor, and from that point on, you will not be able to use it anymore.
+> 1. On September 22, we will deprecate it, i.e., new users will not be able to find it in Store. You will still be able to use it though.
+> 2. At the end of October, we will unpublish this Actor, and from that point on, you will not be able to use it anymore.
>
->Please use this time to change your integrations to our new [Booking Scraper](https://apify.com/voyager/booking-scraper).
+> Please use this time to change your integrations to our new [Booking Scraper](https://apify.com/voyager/booking-scraper).
>
->That’s it! If you have any questions or need more information, don’t hesitate to reach out.
+> That’s it! If you have any questions or need more information, don’t hesitate to reach out.
Warn the users about the deprecation and future unpublishing + add extra information about related Actors if applicable + give them steps and the date when the change takes effect.
Email 2 (after the change, warning about unpublishing)
->✉️ **📢 Deprecated Booking Scraper will stop working as announced 📢**
+> ✉️ **📢 Deprecated Booking Scraper will stop working as announced 📢**
>
->Hi,
+> Hi,
>
->Just a heads-up: today, the deprecated [Booking Scraper](https://console.apify.com/actors/5T5NTHWpvetjeRo3i/console) you have been using will be completely unpublished as announced, and you will not be able to use it anymore.
+> Just a heads-up: today, the deprecated [Booking Scraper](https://console.apify.com/actors/5T5NTHWpvetjeRo3i/console) you have been using will be completely unpublished as announced, and you will not be able to use it anymore.
>
->If you want to continue to scrape Booking.com, make sure to switch to the [latest Actor version](https://apify.com/voyager/booking-scraper).
+> If you want to continue to scrape Booking.com, make sure to switch to the [latest Actor version](https://apify.com/voyager/booking-scraper).
>
->For any assistance or questions, don't hesitate to reach out to our support team.
+> For any assistance or questions, don't hesitate to reach out to our support team.
Remind users to switch to the Actor with a new model.
@@ -103,15 +103,15 @@ Remind users to switch to the Actor with a new model.
Actor downtime, performance issues, Actor directly influenced by platform hiccups.
->✉️ **🛠️ Update on Google Maps Scraper: fixed and ready to go**
+> ✉️ **🛠️ Update on Google Maps Scraper: fixed and ready to go**
>
->Hi,
+> Hi,
>
->We've got a quick update on the Google Maps Scraper for you. If you've been running the Actor this week, you might have noticed some hiccups — scraping was failing for certain places, causing retries and overall slowness.
+> We've got a quick update on the Google Maps Scraper for you. If you've been running the Actor this week, you might have noticed some hiccups — scraping was failing for certain places, causing retries and overall slowness.
>
->We apologize for any inconvenience this may have caused you. The **good news is those performance issues are now resolved**. Feel free to resurrect any affected runs using the "latest" build, should work like a charm now.
+> We apologize for any inconvenience this may have caused you. The **good news is those performance issues are now resolved**. Feel free to resurrect any affected runs using the "latest" build, should work like a charm now.
>
->Need a hand or have questions? Feel free to reply to this email.
+> Need a hand or have questions? Feel free to reply to this email.
Apologize to users and or let them know you're working on it/everything is fixed now. This approach helps maintain trust and reassures users that you're addressing the situation.
diff --git a/sources/academy/build-and-publish/interacting-with-users/issues_tab.md b/sources/academy/build-and-publish/interacting-with-users/issues_tab.md
index 330809182d..0fda16f823 100644
--- a/sources/academy/build-and-publish/interacting-with-users/issues_tab.md
+++ b/sources/academy/build-and-publish/interacting-with-users/issues_tab.md
@@ -93,5 +93,4 @@ When we made the tab public, we took inspiration from StackOverflow’s SEO stra
Politeness goes a long way! Make sure your responses are respectful and straight to the point. It helps to keep things professional, even if the issue seems minor.
-
https://rewind.com/blog/best-practices-for-using-github-issues/
diff --git a/sources/academy/build-and-publish/interacting-with-users/your_store_bio.md b/sources/academy/build-and-publish/interacting-with-users/your_store_bio.md
index 2b51350c67..3063e34fab 100644
--- a/sources/academy/build-and-publish/interacting-with-users/your_store_bio.md
+++ b/sources/academy/build-and-publish/interacting-with-users/your_store_bio.md
@@ -15,7 +15,7 @@ This space is all about helping you shine and promote your tools and skills. Her
- Share your contact email, website, GitHub, X (Twitter), LinkedIn, or Discord handles.
- Summarize what you’ve been doing in Apify Store, your main skills, big achievements, and any relevant experience.
- Offer more ways for people to connect with you, such as links for booking a meeting, discounts, a subscription option for your email newsletter, or your YouTube channel or blog.
- - You can even add a Linktree to keep things neat.
+ - You can even add a Linktree to keep things neat.
- Highlight your other tools on different platforms.
- Get creative by adding banners and GIFs to give your profile some personality.
diff --git a/sources/academy/build-and-publish/promoting-your-actor/parasite_seo.md b/sources/academy/build-and-publish/promoting-your-actor/parasite_seo.md
index 062e981596..cd703a093c 100644
--- a/sources/academy/build-and-publish/promoting-your-actor/parasite_seo.md
+++ b/sources/academy/build-and-publish/promoting-your-actor/parasite_seo.md
@@ -13,7 +13,6 @@ slug: /actor-marketing-playbook/promote-your-actor/parasite-seo
Here’s a full definition, from Authority Hackers:
> Parasite SEO involves publishing a quality piece of content on an established, high-authority external site to rank on search engines. This gives you the benefit of the host’s high traffic, boosting your chances for leads and successful conversions. These high DR websites have a lot of authority and trust in the eyes of Google
->
As you can see, you’re leveraging the existing authority of a third-party site where you can publish content promoting your Actors, and the content should rank better and faster as you publish it on an established site.
diff --git a/sources/academy/build-and-publish/why_publish.md b/sources/academy/build-and-publish/why_publish.md
index 7774cee15f..20732d052a 100644
--- a/sources/academy/build-and-publish/why_publish.md
+++ b/sources/academy/build-and-publish/why_publish.md
@@ -28,7 +28,7 @@ Apify Store offers flexible pricing models that let you match your Actor's value
- Pay-per-event (PPE): Charge for any custom events your Actor triggers (maximum flexibility, AI/MCP compatible, priority store placement)
- Pay-per-result (PPR): Set pricing based on dataset items generated
-(predictable costs for users, unlimited revenue potential)
+ (predictable costs for users, unlimited revenue potential)
- Rental: Charge a flat monthly fee for continuous access (users cover their own platform usage costs)
All models give you 80% of revenue, with platform usage costs deducted for PPR and PPE models.
@@ -70,7 +70,7 @@ Ready to publish? The process involves four main stages:
1. Development: Build your Actor using [Apify SDKs](https://docs.apify.com/sdk), [Crawlee](https://crawlee.dev/), or [Actor templates](https://apify.com/templates)
1. Publication: Set up display information, description, README, and
-monetization
+ monetization
1. Testing: Ensure your Actor works reliably with automated or manual tests
1. Promotion: Optimize for SEO, share on social media, and create tutorials
diff --git a/sources/academy/glossary/concepts/html_elements.md b/sources/academy/glossary/concepts/html_elements.md
index d0c66e754a..7d0a02306e 100644
--- a/sources/academy/glossary/concepts/html_elements.md
+++ b/sources/academy/glossary/concepts/html_elements.md
@@ -14,7 +14,7 @@ An HTML element is a building block of an HTML document. It is used to represent
You can also add **attributes** to an element to provide additional information or to control how the element behaves. For example, the `src` attribute is used to specify the source of an image, like this:
```html
-
+
```
In JavaScript, you can use the **DOM** (Document Object Model) to interact with elements on a web page. For example, you can use the [`querySelector()` method](./querying_css_selectors.md) to select an element by its [CSS selector](./css_selectors.md), like this:
diff --git a/sources/academy/platform/expert_scraping_with_apify/saving_useful_stats.md b/sources/academy/platform/expert_scraping_with_apify/saving_useful_stats.md
index 6bc13433f1..1355cf9087 100644
--- a/sources/academy/platform/expert_scraping_with_apify/saving_useful_stats.md
+++ b/sources/academy/platform/expert_scraping_with_apify/saving_useful_stats.md
@@ -45,11 +45,9 @@ Also, an object including these values should be persisted during the run in th
```json
{
- "errors": { // all of the errors for every request path
- "some-site.com/products/123": [
- "error1",
- "error2"
- ]
+ "errors": {
+ // all of the errors for every request path
+ "some-site.com/products/123": ["error1", "error2"]
},
"totalSaved": 43 // total number of saved items throughout the entire run
}
diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/handling_migrations.md b/sources/academy/platform/expert_scraping_with_apify/solutions/handling_migrations.md
index 24399f8df6..11533ec137 100644
--- a/sources/academy/platform/expert_scraping_with_apify/solutions/handling_migrations.md
+++ b/sources/academy/platform/expert_scraping_with_apify/solutions/handling_migrations.md
@@ -71,18 +71,20 @@ router.addHandler(labels.START, async ({ $, crawler, request }) => {
// and initialize its collected offers count to 0
tracker.incrementASIN(element.attr('data-asin'));
- await crawler.addRequest([{
- url,
- label: labels.PRODUCT,
- userData: {
- data: {
- title: titleElement.first().text().trim(),
- asin: element.attr('data-asin'),
- itemUrl: url,
- keyword,
+ await crawler.addRequest([
+ {
+ url,
+ label: labels.PRODUCT,
+ userData: {
+ data: {
+ title: titleElement.first().text().trim(),
+ asin: element.attr('data-asin'),
+ itemUrl: url,
+ keyword,
+ },
},
},
- }]);
+ ]);
}
});
@@ -91,16 +93,18 @@ router.addHandler(labels.PRODUCT, async ({ $, crawler, request }) => {
const element = $('div#productDescription');
- await crawler.addRequests([{
- url: OFFERS_URL(data.asin),
- label: labels.OFFERS,
- userData: {
- data: {
- ...data,
- description: element.text().trim(),
+ await crawler.addRequests([
+ {
+ url: OFFERS_URL(data.asin),
+ label: labels.OFFERS,
+ userData: {
+ data: {
+ ...data,
+ description: element.text().trim(),
+ },
},
},
- }]);
+ ]);
});
router.addHandler(labels.OFFERS, async ({ $, request }) => {
diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md b/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md
index 88755208eb..f8e3f36263 100644
--- a/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md
+++ b/sources/academy/platform/expert_scraping_with_apify/solutions/rotating_proxies.md
@@ -40,9 +40,7 @@ const crawler = new CheerioCrawler({
// We can add options for each
// session created by the session
// pool here
- sessionOptions: {
-
- },
+ sessionOptions: {},
},
maxConcurrency: 50,
// ...
diff --git a/sources/academy/platform/expert_scraping_with_apify/solutions/using_api_and_client.md b/sources/academy/platform/expert_scraping_with_apify/solutions/using_api_and_client.md
index e2cf1cf0d3..882ef89666 100644
--- a/sources/academy/platform/expert_scraping_with_apify/solutions/using_api_and_client.md
+++ b/sources/academy/platform/expert_scraping_with_apify/solutions/using_api_and_client.md
@@ -150,39 +150,39 @@ And before we push to the platform, let's not forget to write an input schema in
```json
{
- "title": "Actor Caller",
- "type": "object",
- "schemaVersion": 1,
- "properties": {
- "memory": {
- "title": "Memory",
- "type": "integer",
- "description": "Select memory in megabytes.",
- "default": 4096,
- "maximum": 32768,
- "unit": "MB"
+ "title": "Actor Caller",
+ "type": "object",
+ "schemaVersion": 1,
+ "properties": {
+ "memory": {
+ "title": "Memory",
+ "type": "integer",
+ "description": "Select memory in megabytes.",
+ "default": 4096,
+ "maximum": 32768,
+ "unit": "MB"
+ },
+ "useClient": {
+ "title": "Use client?",
+ "type": "boolean",
+ "description": "Specifies whether the Apify JS client, or the pure Apify API should be used.",
+ "default": true
+ },
+ "fields": {
+ "title": "Fields",
+ "type": "array",
+ "description": "Enter the dataset fields to export to CSV",
+ "prefill": ["title", "url", "price"],
+ "editor": "stringList"
+ },
+ "maxItems": {
+ "title": "Max items",
+ "type": "integer",
+ "description": "Fill the maximum number of items to export.",
+ "default": 10
+ }
},
- "useClient": {
- "title": "Use client?",
- "type": "boolean",
- "description": "Specifies whether the Apify JS client, or the pure Apify API should be used.",
- "default": true
- },
- "fields": {
- "title": "Fields",
- "type": "array",
- "description": "Enter the dataset fields to export to CSV",
- "prefill": ["title", "url", "price"],
- "editor": "stringList"
- },
- "maxItems": {
- "title": "Max items",
- "type": "integer",
- "description": "Fill the maximum number of items to export.",
- "default": 10
- }
- },
- "required": ["useClient", "memory", "fields", "maxItems"]
+ "required": ["useClient", "memory", "fields", "maxItems"]
}
```
diff --git a/sources/academy/platform/get_most_of_actors/monetizing_your_actor.md b/sources/academy/platform/get_most_of_actors/monetizing_your_actor.md
index 0f3726b1f7..74bd756455 100644
--- a/sources/academy/platform/get_most_of_actors/monetizing_your_actor.md
+++ b/sources/academy/platform/get_most_of_actors/monetizing_your_actor.md
@@ -28,7 +28,8 @@ You make your Actor rental with 7 days free trial and then $30/month. During the
2. Second user, on Apify paid plan, starts the free trial on 25th
3. Third user, on Apify free plan, start the free trial on 20th
-The first user pays their first rent 7 days after the free trial, i.e., on 22nd. The second user only starts paying the rent next month. The third user is on Apify free plan, so after the free trial ends on 27th, they are not charged and cannot use the Actor further until they get a paid plan. Your profit is computed only from the first user. They were charged $30, so 80% of this goes to you, i.e., _0.8 * 30 = $24_.
+The first user pays their first rent 7 days after the free trial, i.e., on 22nd. The second user only starts paying the rent next month. The third user is on Apify free plan, so after the free trial ends on 27th, they are not charged and cannot use the Actor further until they get a paid plan. Your profit is computed only from the first user. They were charged $30, so 80% of this goes to you, i.e., _0.8 \* 30 = $24_.
+
## Pay-per-result pricing model
@@ -40,7 +41,7 @@ In this model, you set a price per 1000 results. Users are charged based on the
### Pay-per-result unit pricing for cost computation
| Service | Unit price |
-|:--------------------------------|:---------------------------|
+| :------------------------------ | :------------------------- |
| Compute unit | **$0.3** / CU |
| Residential proxies | **$13** / GB |
| SERPs proxy | **$3** / 1,000 SERPs |
@@ -57,7 +58,6 @@ In this model, you set a price per 1000 results. Users are charged based on the
| Request queue - reads | **$0.004** / 1,000 reads |
| Request queue - writes | **$0.02** / 1,000 writes |
-
Only revenue & cost for Apify customers on paid plans are taken into consideration when computing your profit. Users on free plans are not reflected there, although you can see statistics about the potential revenue of users that are currently on free plans in Actor Insights in the Apify Console.
:::note What are Gigabyte-hours?
@@ -66,8 +66,8 @@ Gigabyte-hours (GB-hours) are a unit of measurement used to quantify data storag
For example, if you host 50GB of data for 30 days:
-- Convert days to hours: _30 * 24 = 720_
-- Multiply data size by hours: _50 * 720 = 36,000_
+- Convert days to hours: _30 \* 24 = 720_
+- Multiply data size by hours: _50 \* 720 = 36,000_
This means that storing 50 GB of data for 30 days results in 36,000 GB-hours.
:::
@@ -79,7 +79,8 @@ Read more about Actors in the Store and different pricing models from the perspe
You make your Actor pay-per-result and set price to be $1/1,000 results. During the first month, two users on Apify paid plans use your Actor to get 50,000 and 20,000 results, costing them $50 and $20 respectively. Let's say the underlying platform usage for the first user is $5 and for the second $2. Third user, this time on Apify free plan, uses the Actor to get 5,000 results, with underlying platform usage $0.5.
-Your profit is computed only from the first two users, since they are on Apify paid plans. The revenue for the first user is $50 and for the second $20, i.e., total revenue is $70. The total underlying cost is _$5 + $2 = $7_. Since your profit is 80% of the revenue minus the cost, it would be _0.8 * 70 - 7 = $49_.
+Your profit is computed only from the first two users, since they are on Apify paid plans. The revenue for the first user is $50 and for the second $20, i.e., total revenue is $70. The total underlying cost is _$5 + $2 = $7_. Since your profit is 80% of the revenue minus the cost, it would be _0.8 \* 70 - 7 = $49_.
+
### Best practices for Pay-per-results Actors
@@ -143,7 +144,6 @@ Create SEO-optimized descriptions and README files to improve search engine visi
- Publish articles about your Actor on relevant websites
- Consider creating a product showcase on platforms like Product hunt
-
Remember to tag Apify in your social media posts for additional exposure. Effective promotion can significantly impact your Actor's success, differentiating between those with many paid users and those with few to none.
Learn more about promoting your Actor from [Apify's Marketing Playbook](https://apify.notion.site/3fdc9fd4c8164649a2024c9ca7a2d0da?v=6d262c0b026d49bfa45771cd71f8c9ab).
diff --git a/sources/academy/platform/getting_started/apify_client.md b/sources/academy/platform/getting_started/apify_client.md
index 4cc4b4e6d2..7ffccad988 100644
--- a/sources/academy/platform/getting_started/apify_client.md
+++ b/sources/academy/platform/getting_started/apify_client.md
@@ -12,7 +12,7 @@ import TabItem from '@theme/TabItem';
---
-Now that you've gotten your toes wet with interacting with the Apify API through raw HTTP requests, you're ready to become familiar with the **Apify client**, which is a package available for both JavaScript and Python that allows you to interact with the API in your code without explicitly needing to make any GET or POST requests.
+Now that you've gotten your toes wet with interacting with the Apify API through raw HTTP requests, you're ready to become familiar with the **Apify client**, which is a package available for both JavaScript and Python that allows you to interact with the API in your code without explicitly needing to make any GET or POST requests.
This lesson will provide code examples for both Node.js and Python, so regardless of the language you are using, you can follow along!
diff --git a/sources/academy/platform/getting_started/creating_actors.md b/sources/academy/platform/getting_started/creating_actors.md
index a1c6505b43..b7eea4ccb2 100644
--- a/sources/academy/platform/getting_started/creating_actors.md
+++ b/sources/academy/platform/getting_started/creating_actors.md
@@ -22,9 +22,9 @@ You'll be presented with a page featuring two ways to get started with a new Act
1. Creating an Actor from existing source code (using Git providers or pushing the code from your local machine using Apify CLI)
2. Creating an Actor from a code template
-| Existing source code | Code templates |
-|:---------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------:|
-|  |  |
+| Existing source code | Code templates |
+| :------------------------------------------------------------------------------: | :------------------------------------------------------------------------------: |
+|  |  |
## Creating Actor from existing source code {#existing-source-code}
diff --git a/sources/academy/tutorials/api/index.md b/sources/academy/tutorials/api/index.md
index 4cd998bdd9..face662d28 100644
--- a/sources/academy/tutorials/api/index.md
+++ b/sources/academy/tutorials/api/index.md
@@ -14,4 +14,3 @@ This section explains how you can run [Apify Actors](/platform/actors) using Api
- [JavaScript](/api/client/js/)
- [Python](/api/client/python)
-
diff --git a/sources/academy/tutorials/api/retry_failed_requests.md b/sources/academy/tutorials/api/retry_failed_requests.md
index 90c25aed59..03bf616123 100644
--- a/sources/academy/tutorials/api/retry_failed_requests.md
+++ b/sources/academy/tutorials/api/retry_failed_requests.md
@@ -24,7 +24,7 @@ const REQUEST_QUEUE_ID = 'pFCvCasdvsyvyZdfD'; // Replace with your valid request
const allRequests = [];
let exclusiveStartId = null;
// List all requests from the queue, we have to do it in a loop because the request queue list is paginated
-for (; ;) {
+for (;;) {
const { items: requests } = await Actor.apifyClient
.requestQueue(REQUEST_QUEUE_ID)
.listRequests({ exclusiveStartId, limit: 1000 });
@@ -41,7 +41,9 @@ for (; ;) {
console.log(`Loaded ${allRequests.length} requests from the queue`);
// Now we filter the failed requests
-const failedRequests = allRequests.filter((request) => (request.errorMessages?.length || 0) > (request.retryCount || 0));
+const failedRequests = allRequests.filter(
+ (request) => (request.errorMessages?.length || 0) > (request.retryCount || 0),
+);
// We need to update them 1 by 1 to the pristine state
for (const request of failedRequests) {
diff --git a/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md b/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md
index d852059967..18681c4b27 100644
--- a/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md
+++ b/sources/academy/tutorials/api/run_actor_and_retrieve_data_via_api.md
@@ -15,7 +15,6 @@ The most popular way of [integrating](https://help.apify.com/en/collections/1669
> Remember to check out our [API documentation](/api/v2) with examples in different languages and a live API console. We also recommend testing the API with a desktop client like [Postman](https://www.postman.com/) or [Insomnia](https://insomnia.rest).
-
Apify API offers two ways of interacting with it:
- [Synchronously](#synchronous-flow)
diff --git a/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md b/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md
index 586a835b40..e1be05ce19 100644
--- a/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md
+++ b/sources/academy/tutorials/apify_scrapers/cheerio_scraper.md
@@ -7,7 +7,7 @@ sidebar_position: 3
slug: /apify-scrapers/cheerio-scraper
---
-[//]: # (TODO: Should be updated)
+[//]: # 'TODO: Should be updated'
#
@@ -34,18 +34,18 @@ Now that's out of the way, let's open one of the Actor detail pages in the Store
**Web Scraper** ([apify/web-scraper](https://apify.com/apify/web-scraper)) page, and use our DevTools-Fu to scrape some data.
> If you're wondering why we're using Web Scraper as an example instead of Cheerio Scraper,
-it's only because we didn't want to triple the number of screenshots we needed to make. Lazy developers!
+> it's only because we didn't want to triple the number of screenshots we needed to make. Lazy developers!
## Building our Page function
Before we start, let's do a quick recap of the data we chose to scrape:
- 1. **URL** - The URL that goes directly to the Actor's detail page.
- 2. **Unique identifier** - Such as **apify/web-scraper**.
- 3. **Title** - The title visible in the Actor's detail page.
- 4. **Description** - The Actor's description.
- 5. **Last modification date** - When the Actor was last modified.
- 6. **Number of runs** - How many times the Actor was run.
+1. **URL** - The URL that goes directly to the Actor's detail page.
+2. **Unique identifier** - Such as **apify/web-scraper**.
+3. **Title** - The title visible in the Actor's detail page.
+4. **Description** - The Actor's description.
+5. **Last modification date** - When the Actor was last modified.
+6. **Number of runs** - How many times the Actor was run.

@@ -110,11 +110,7 @@ async function pageFunction(context) {
return {
title: $('header h1').text(),
description: $('header span.actor-description').text(),
- modifiedDate: new Date(
- Number(
- $('ul.ActorHeader-stats time').attr('datetime'),
- ),
- ),
+ modifiedDate: new Date(Number($('ul.ActorHeader-stats time').attr('datetime'))),
};
}
```
@@ -137,11 +133,7 @@ async function pageFunction(context) {
return {
title: $('header h1').text(),
description: $('header span.actor-description').text(),
- modifiedDate: new Date(
- Number(
- $('ul.ActorHeader-stats time').attr('datetime'),
- ),
- ),
+ modifiedDate: new Date(Number($('ul.ActorHeader-stats time').attr('datetime'))),
runCount: Number(
$('ul.ActorHeader-stats > li:nth-of-type(3)')
.text()
@@ -175,21 +167,14 @@ async function pageFunction(context) {
const { url } = request;
// ... rest of your code can come here
- const uniqueIdentifier = url
- .split('/')
- .slice(-2)
- .join('/');
+ const uniqueIdentifier = url.split('/').slice(-2).join('/');
return {
url,
uniqueIdentifier,
title: $('header h1').text(),
description: $('header span.actor-description').text(),
- modifiedDate: new Date(
- Number(
- $('ul.ActorHeader-stats time').attr('datetime'),
- ),
- ),
+ modifiedDate: new Date(Number($('ul.ActorHeader-stats time').attr('datetime'))),
runCount: Number(
$('ul.ActorHeader-stats > li:nth-of-type(3)')
.text()
@@ -216,21 +201,14 @@ async function pageFunction(context) {
await skipLinks();
// Do some scraping.
- const uniqueIdentifier = url
- .split('/')
- .slice(-2)
- .join('/');
+ const uniqueIdentifier = url.split('/').slice(-2).join('/');
return {
url,
uniqueIdentifier,
title: $('header h1').text(),
description: $('header span.actor-description').text(),
- modifiedDate: new Date(
- Number(
- $('ul.ActorHeader-stats time').attr('datetime'),
- ),
- ),
+ modifiedDate: new Date(Number($('ul.ActorHeader-stats time').attr('datetime'))),
runCount: Number(
$('ul.ActorHeader-stats > li:nth-of-type(3)')
.text()
@@ -255,9 +233,9 @@ actually scrape all the Actors, just the first page of results. That's because t
one needs to click the **Show more** button at the very bottom of the list. This is pagination.
> This is a typical JavaScript pagination, sometimes called infinite scroll. Other pages may use links
-that take you to the next page. If you encounter those, make a Pseudo URL for those links and they
-will be automatically enqueued to the request queue. Use a label to let the scraper know what kind of URL
-it's processing.
+> that take you to the next page. If you encounter those, make a Pseudo URL for those links and they
+> will be automatically enqueued to the request queue. Use a label to let the scraper know what kind of URL
+> it's processing.
If you paid close attention, you may now see a problem. How do we click a button in the page when we're working
with Cheerio? We don't have a browser to do it and we only have the HTML of the page to work with. The simple
@@ -305,9 +283,9 @@ we need is there, in the `data.props.pageProps.items` array. Great!

> It's obvious that all the information we set to scrape is available in this one data object,
-so you might already be wondering, can I make one request to the store to get this JSON
-and then parse it out and be done with it in a single request? Yes you can! And that's the power
-of clever page analysis.
+> so you might already be wondering, can I make one request to the store to get this JSON
+> and then parse it out and be done with it in a single request? Yes you can! And that's the power
+> of clever page analysis.
### Using the data to enqueue all Actor details
@@ -339,8 +317,8 @@ We iterate through the items we found, build Actor detail URLs from the availabl
those URLs into the request queue. We need to specify the label too, otherwise our page function wouldn't know
how to route those requests.
->If you're wondering how we know the structure of the URL, see the [Getting started
-with Apify Scrapers](./getting_started.md) tutorial again.
+> If you're wondering how we know the structure of the URL, see the [Getting started
+> with Apify Scrapers](./getting_started.md) tutorial again.
### Plugging it into the Page function
@@ -374,21 +352,14 @@ async function pageFunction(context) {
await skipLinks();
// Do some scraping.
- const uniqueIdentifier = url
- .split('/')
- .slice(-2)
- .join('/');
+ const uniqueIdentifier = url.split('/').slice(-2).join('/');
return {
url,
uniqueIdentifier,
title: $('header h1').text(),
description: $('header span.actor-description').text(),
- modifiedDate: new Date(
- Number(
- $('ul.ActorHeader-stats time').attr('datetime'),
- ),
- ),
+ modifiedDate: new Date(Number($('ul.ActorHeader-stats time').attr('datetime'))),
runCount: Number(
$('ul.ActorHeader-stats > li:nth-of-type(3)')
.text()
@@ -406,10 +377,10 @@ You should have a table of all the Actor's details in front of you. If you do, g
scraped Apify Store. And if not, no worries, go through the code examples again, it's probably just a typo.
> There's an important caveat. The way we implemented pagination here is in no way a generic system that you can
-use with other websites. Cheerio is fast (and that means it's cheap), but it's not easy. Sometimes there's just no way
-to get all results with Cheerio only and other times it takes hours of research. Keep this in mind when choosing
-the right scraper for your job. But don't get discouraged. Often times, the only thing you will ever need is to
-define a correct Pseudo URL. Do your research first before giving up on Cheerio Scraper.
+> use with other websites. Cheerio is fast (and that means it's cheap), but it's not easy. Sometimes there's just no way
+> to get all results with Cheerio only and other times it takes hours of research. Keep this in mind when choosing
+> the right scraper for your job. But don't get discouraged. Often times, the only thing you will ever need is to
+> define a correct Pseudo URL. Do your research first before giving up on Cheerio Scraper.
## Downloading the scraped data
@@ -436,9 +407,12 @@ that encapsulate all the different logic. You can, for example, define a functio
```js
async function pageFunction(context) {
switch (context.request.userData.label) {
- case 'START': return handleStart(context);
- case 'DETAIL': return handleDetail(context);
- default: throw new Error('Unknown request label.');
+ case 'START':
+ return handleStart(context);
+ case 'DETAIL':
+ return handleDetail(context);
+ default:
+ throw new Error('Unknown request label.');
}
async function handleStart({ log, waitFor, $ }) {
@@ -466,21 +440,14 @@ async function pageFunction(context) {
await skipLinks();
// Do some scraping.
- const uniqueIdentifier = url
- .split('/')
- .slice(-2)
- .join('/');
+ const uniqueIdentifier = url.split('/').slice(-2).join('/');
return {
url,
uniqueIdentifier,
title: $('header h1').text(),
description: $('header span.actor-description').text(),
- modifiedDate: new Date(
- Number(
- $('ul.ActorHeader-stats time').attr('datetime'),
- ),
- ),
+ modifiedDate: new Date(Number($('ul.ActorHeader-stats time').attr('datetime'))),
runCount: Number(
$('ul.ActorHeader-stats > li:nth-of-type(3)')
.text()
@@ -493,7 +460,7 @@ async function pageFunction(context) {
```
> If you're confused by the functions being declared below their executions, it's called hoisting and it's a feature
-of JavaScript. It helps you put what matters on top, if you so desire.
+> of JavaScript. It helps you put what matters on top, if you so desire.
## Final word
@@ -501,10 +468,9 @@ Thank you for reading this whole tutorial! Really! It's important to us that our
## What's next
-* Check out the [Apify SDK](https://docs.apify.com/sdk) and its [Getting started](https://docs.apify.com/sdk/js/docs/guides/apify-platform) tutorial if you'd like to try building your own Actors. It's a bit more complex and involved than writing a `pageFunction`, but it allows you to fine-tune all the details of your scraper to your liking.
-* [Take a deep dive into Actors](/platform/actors), from how they work to [publishing](/platform/actors/publishing) them in Apify Store, and even [making money](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) on Actors.
-* Found out you're not into the coding part but would still to use Apify Actors? Check out our [ready-made solutions](https://apify.com/store) or [order a custom Actor](https://apify.com/contact-sales) from an Apify-certified developer.
-
+- Check out the [Apify SDK](https://docs.apify.com/sdk) and its [Getting started](https://docs.apify.com/sdk/js/docs/guides/apify-platform) tutorial if you'd like to try building your own Actors. It's a bit more complex and involved than writing a `pageFunction`, but it allows you to fine-tune all the details of your scraper to your liking.
+- [Take a deep dive into Actors](/platform/actors), from how they work to [publishing](/platform/actors/publishing) them in Apify Store, and even [making money](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) on Actors.
+- Found out you're not into the coding part but would still to use Apify Actors? Check out our [ready-made solutions](https://apify.com/store) or [order a custom Actor](https://apify.com/contact-sales) from an Apify-certified developer.
**Learn how to scrape a website using Apify's Cheerio Scraper. Build an Actor's page function, extract information from a web page and download your data.**
diff --git a/sources/academy/tutorials/apify_scrapers/getting_started.md b/sources/academy/tutorials/apify_scrapers/getting_started.md
index 9b05130eba..9ef63ae65c 100644
--- a/sources/academy/tutorials/apify_scrapers/getting_started.md
+++ b/sources/academy/tutorials/apify_scrapers/getting_started.md
@@ -7,7 +7,7 @@ sidebar_position: 1
slug: /apify-scrapers/getting-started
---
-[//]: # (TODO: Should be updated)
+[//]: # 'TODO: Should be updated'
#
@@ -37,7 +37,7 @@ Scroll down to the **Performance and limits** section and set the **Max pages pe
> This also helps with keeping your [compute unit](/platform/actors/running/usage-and-resources) (CU) consumption low. To get an idea, our free plan includes 10 CUs and this run will consume about 0.04 CU, so you can run it 250 times a month for free. If you accidentally go over the limit, no worries, we won't charge you for it. You just won't be able to run more tasks that month.
-Now click **Save & Run**! *(in the bottom-left part of your screen)*
+Now click **Save & Run**! _(in the bottom-left part of your screen)_
### The run detail
@@ -97,12 +97,12 @@ Since this is a tutorial, we'll be scraping our own website. [Apify Store](https
We want to create a scraper that scrapes all the Actors in the store and collects the following attributes for each Actor:
- 1. **URL** - The URL that goes directly to the Actor's detail page.
- 2. **Unique identifier** - Such as **apify/web-scraper**.
- 3. **Title** - The title visible in the Actor's detail page.
- 4. **Description** - The Actor's description.
- 5. **Last modification date** - When the Actor was last modified.
- 6. **Number of runs** - How many times the Actor was run.
+1. **URL** - The URL that goes directly to the Actor's detail page.
+2. **Unique identifier** - Such as **apify/web-scraper**.
+3. **Title** - The title visible in the Actor's detail page.
+4. **Description** - The Actor's description.
+5. **Last modification date** - When the Actor was last modified.
+6. **Number of runs** - How many times the Actor was run.
Some of this information may be scraped directly from the listing pages, but for the rest, we will need to visit the detail pages of all the Actors.
@@ -120,7 +120,7 @@ We also need to somehow distinguish the **Start URL** from all the other URLs th
```json
{
- "label": "START"
+ "label": "START"
}
```
@@ -184,7 +184,7 @@ Let's use the above **Pseudo URL** in our task. We should also add a label as we
```json
{
- "label": "DETAIL"
+ "label": "DETAIL"
}
```
@@ -268,8 +268,8 @@ async function pageFunction(context) {
will produce the following table:
-| title | url |
-| ----- | --- |
+| title | url |
+| ---------------------------------------------------- | ----------------- |
| Web Scraping, Data Extraction and Automation - Apify | https://apify.com |
## Scraper lifecycle
@@ -279,12 +279,12 @@ or in other words, what the scraper actually does when it scrapes. It's quite st
The scraper:
- 1. Visits the first **Start URL** and waits for the page to load.
- 2. Executes the `pageFunction`.
- 3. Finds all the elements matching the **Link selector** and extracts their `href` attributes (URLs).
- 4. Uses the **pseudo URLs** to filter the extracted URLs and throws away those that don't match.
- 5. Enqueues the matching URLs to the end of the crawling queue.
- 6. Closes the page and selects a new URL to visit, either from the **Start URL**s if there are any left, or from the beginning of the crawling queue.
+1. Visits the first **Start URL** and waits for the page to load.
+2. Executes the `pageFunction`.
+3. Finds all the elements matching the **Link selector** and extracts their `href` attributes (URLs).
+4. Uses the **pseudo URLs** to filter the extracted URLs and throws away those that don't match.
+5. Enqueues the matching URLs to the end of the crawling queue.
+6. Closes the page and selects a new URL to visit, either from the **Start URL**s if there are any left, or from the beginning of the crawling queue.
> When you're not using the request queue, the scraper repeats steps 1 and 2. You would not use the request queue when you already know all the URLs you want to visit. For example, when you have a pre-existing list of a thousand URLs that you uploaded as a text file. Or when scraping a single URL.
@@ -314,10 +314,7 @@ async function pageFunction(context) {
await skipLinks();
// Do some scraping.
- const uniqueIdentifier = url
- .split('/')
- .slice(-2)
- .join('/');
+ const uniqueIdentifier = url.split('/').slice(-2).join('/');
return {
url,
diff --git a/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md b/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md
index 130713691a..312d48ed5c 100644
--- a/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md
+++ b/sources/academy/tutorials/apify_scrapers/puppeteer_scraper.md
@@ -7,7 +7,7 @@ sidebar_position: 4
slug: /apify-scrapers/puppeteer-scraper
---
-[//]: # (TODO: Should be updated)
+[//]: # 'TODO: Should be updated'
#
@@ -30,8 +30,8 @@ so that you can start using it for some of the most typical scraping tasks, but
you'll need to visit its [documentation](https://pptr.dev/) and really dive deep into its intricacies.
> The purpose of Puppeteer Scraper is to remove some of the difficulty faced when using Puppeteer by wrapping
-it in a nice, manageable UI. It provides almost all of its features in a format that is much easier to grasp
-when first trying to scrape using Puppeteer.
+> it in a nice, manageable UI. It provides almost all of its features in a format that is much easier to grasp
+> when first trying to scrape using Puppeteer.
### Web Scraper differences
@@ -49,18 +49,18 @@ Puppeteer Scraper is powerful (and the [Apify SDK](https://sdk.apify.com) is sup
Now that's out of the way, let's open one of the Actor detail pages in the Store, for example the Web Scraper page and use our DevTools-Fu to scrape some data.
> If you're wondering why we're using Web Scraper as an example instead of Puppeteer Scraper,
-it's only because we didn't want to triple the number of screenshots we needed to make. Lazy developers!
+> it's only because we didn't want to triple the number of screenshots we needed to make. Lazy developers!
## Building our Page function
Before we start, let's do a quick recap of the data we chose to scrape:
- 1. **URL** - The URL that goes directly to the Actor's detail page.
- 2. **Unique identifier** - Such as **apify/web-scraper**.
- 3. **Title** - The title visible in the Actor's detail page.
- 4. **Description** - The Actor's description.
- 5. **Last modification date** - When the Actor was last modified.
- 6. **Number of runs** - How many times the Actor was run.
+1. **URL** - The URL that goes directly to the Actor's detail page.
+2. **Unique identifier** - Such as **apify/web-scraper**.
+3. **Title** - The title visible in the Actor's detail page.
+4. **Description** - The Actor's description.
+5. **Last modification date** - When the Actor was last modified.
+6. **Number of runs** - How many times the Actor was run.

@@ -87,10 +87,7 @@ And as we already know, there's only one.
// Using Puppeteer
async function pageFunction(context) {
const { page } = context;
- const title = await page.$eval(
- 'header h1',
- ((el) => el.textContent),
- );
+ const title = await page.$eval('header h1', (el) => el.textContent);
return {
title,
@@ -113,14 +110,8 @@ the `` element too, same as the title. Moreover, the actual description
```js
async function pageFunction(context) {
const { page } = context;
- const title = await page.$eval(
- 'header h1',
- ((el) => el.textContent),
- );
- const description = await page.$eval(
- 'header span.actor-description',
- ((el) => el.textContent),
- );
+ const title = await page.$eval('header h1', (el) => el.textContent);
+ const description = await page.$eval('header span.actor-description', (el) => el.textContent);
return {
title,
@@ -138,18 +129,11 @@ The DevTools tell us that the `modifiedDate` can be found in a `
+
Subtitle
+
Paragraph
+
+
+ Heading
+
```
@@ -99,10 +97,10 @@ You can combine selectors to narrow results. For example, `p.lead` matches `p` e
```html
-
-
Lead paragraph.
-
Paragraph
-
Paragraph
+
+
Lead paragraph.
+
Paragraph
+
Paragraph
```
@@ -112,7 +110,7 @@ How did we know `.product-item` selects a product card? By inspecting the markup
Multiple approaches often exist for creating a CSS selector that targets the element we want. We should pick selectors that are simple, readable, unique, and semantically tied to the data. These are **resilient selectors**. They're the most reliable and likely to survive website updates. We better avoid randomly generated attributes like `class="F4jsL8"`, as they tend to change without warning.
-The product card has four classes: `product-item`, `product-item--vertical`, `1/3--tablet-and-up`, and `1/4--desk`. Only the first one checks all the boxes. A product card *is* a product item, after all. The others seem more about styling—defining how the element looks on the screen—and are probably tied to CSS rules.
+The product card has four classes: `product-item`, `product-item--vertical`, `1/3--tablet-and-up`, and `1/4--desk`. Only the first one checks all the boxes. A product card _is_ a product item, after all. The others seem more about styling—defining how the element looks on the screen—and are probably tied to CSS rules.
This class is also unique enough in the page's context. If it were something generic like `item`, there would be a higher risk that developers of the website might use it for unrelated elements. In the **Elements** tab, we can see a parent element `product-list` that contains all the product cards marked as `product-item`. This structure aligns with the data we're after.
@@ -158,12 +156,12 @@ On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use
Solution
- 1. Open the [Main Page](https://en.wikipedia.org/wiki/Main_Page).
- 1. Activate the element selection tool in your DevTools.
- 1. Click on several headings to examine the markup.
- 1. Notice that all headings are `h2` elements with the `mp-h2` class.
- 1. In the **Console**, execute `document.querySelectorAll('h2')`.
- 1. At the time of writing, this selector returns 8 headings. Each corresponds to a box, and there are no other `h2` elements on the page. Thus, the selector is sufficient as is.
+1. Open the [Main Page](https://en.wikipedia.org/wiki/Main_Page).
+1. Activate the element selection tool in your DevTools.
+1. Click on several headings to examine the markup.
+1. Notice that all headings are `h2` elements with the `mp-h2` class.
+1. In the **Console**, execute `document.querySelectorAll('h2')`.
+1. At the time of writing, this selector returns 8 headings. Each corresponds to a box, and there are no other `h2` elements on the page. Thus, the selector is sufficient as is.
@@ -176,13 +174,13 @@ Go to Shein's [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewel
Solution
- 1. Visit the [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) page. Close any pop-ups or promotions.
- 1. Activate the element selection tool in your DevTools.
- 1. Click on the first product to inspect its markup. Repeat with a few others.
- 1. Observe that all products are `section` elements with multiple classes, including `product-card`.
- 1. Since `section` is a generic wrapper, focus on the `product-card` class.
- 1. In the **Console**, execute `document.querySelectorAll('.product-card')`.
- 1. At the time of writing, this selector returns 120 results, all representing products. No further narrowing is necessary.
+1. Visit the [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) page. Close any pop-ups or promotions.
+1. Activate the element selection tool in your DevTools.
+1. Click on the first product to inspect its markup. Repeat with a few others.
+1. Observe that all products are `section` elements with multiple classes, including `product-card`.
+1. Since `section` is a generic wrapper, focus on the `product-card` class.
+1. In the **Console**, execute `document.querySelectorAll('.product-card')`.
+1. At the time of writing, this selector returns 120 results, all representing products. No further narrowing is necessary.
@@ -201,13 +199,13 @@ Learn about the [descendant combinator](https://developer.mozilla.org/en-US/docs
Solution
- 1. Open the [page about F1](https://www.theguardian.com/sport/formulaone).
- 1. Activate the element selection tool in your DevTools.
- 1. Click on an article to inspect its structure. Check several articles, including the ones with smaller cards.
- 1. Note that all articles are `li` elements, but their classes (e.g., `dcr-1qmyfxi`) are dynamically generated and unreliable.
- 1. Using `document.querySelectorAll('li')` returns too many results, including unrelated items like navigation links.
- 1. Inspect the page structure. The `main` element contains the primary content, including articles. Use the descendant combinator to target `li` elements within `main`.
- 1. In the **Console**, execute `document.querySelectorAll('main li')`.
- 1. At the time of writing, this selector returns 21 results. All appear to represent articles, so the solution works!
+1. Open the [page about F1](https://www.theguardian.com/sport/formulaone).
+1. Activate the element selection tool in your DevTools.
+1. Click on an article to inspect its structure. Check several articles, including the ones with smaller cards.
+1. Note that all articles are `li` elements, but their classes (e.g., `dcr-1qmyfxi`) are dynamically generated and unreliable.
+1. Using `document.querySelectorAll('li')` returns too many results, including unrelated items like navigation links.
+1. Inspect the page structure. The `main` element contains the primary content, including articles. Use the descendant combinator to target `li` elements within `main`.
+1. In the **Console**, execute `document.querySelectorAll('main li')`.
+1. At the time of writing, this selector returns 21 results. All appear to represent articles, so the solution works!
diff --git a/sources/academy/webscraping/scraping_basics_javascript/03_devtools_extracting_data.md b/sources/academy/webscraping/scraping_basics_javascript/03_devtools_extracting_data.md
index e774b7e1d7..43db1012b9 100644
--- a/sources/academy/webscraping/scraping_basics_javascript/03_devtools_extracting_data.md
+++ b/sources/academy/webscraping/scraping_basics_javascript/03_devtools_extracting_data.md
@@ -1,12 +1,12 @@
---
title: Extracting data from a web page with browser DevTools
-sidebar_label: "DevTools: Extracting data"
+sidebar_label: 'DevTools: Extracting data'
description: Lesson about using the browser tools for developers to manually extract product data from an e-commerce website.
slug: /scraping-basics-javascript/devtools-extracting-data
---
import LegacyJsCourseAdmonition from '@site/src/components/LegacyJsCourseAdmonition';
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
@@ -86,15 +86,15 @@ At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/a
Solution
- 1. Open the [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/).
- 1. Sort the products by price, from high to low, so the most expensive plant appears first in the listing.
- 1. Activate the element selection tool in your DevTools.
- 1. Click on the price of the first and most expensive plant.
- 1. Notice that the price is structured into two elements, with the integer separated from the currency, under a class named `plp-price__integer`. This structure is convenient for extracting the value.
- 1. In the **Console**, execute `document.querySelector('.plp-price__integer')`. This returns the element representing the first price in the listing. Since `document.querySelector()` returns the first matching element, it directly selects the most expensive plant's price.
- 1. Save the element in a variable by executing `price = document.querySelector('.plp-price__integer')`.
- 1. Convert the price text into a number by executing `parseInt(price.textContent)`.
- 1. At the time of writing, this returns `699`, meaning [699 SEK](https://www.google.com/search?q=699%20sek).
+1. Open the [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/).
+1. Sort the products by price, from high to low, so the most expensive plant appears first in the listing.
+1. Activate the element selection tool in your DevTools.
+1. Click on the price of the first and most expensive plant.
+1. Notice that the price is structured into two elements, with the integer separated from the currency, under a class named `plp-price__integer`. This structure is convenient for extracting the value.
+1. In the **Console**, execute `document.querySelector('.plp-price__integer')`. This returns the element representing the first price in the listing. Since `document.querySelector()` returns the first matching element, it directly selects the most expensive plant's price.
+1. Save the element in a variable by executing `price = document.querySelector('.plp-price__integer')`.
+1. Convert the price text into a number by executing `parseInt(price.textContent)`.
+1. At the time of writing, this returns `699`, meaning [699 SEK](https://www.google.com/search?q=699%20sek).
@@ -107,13 +107,13 @@ On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selecto
Solution
- 1. Open the [Movies page](https://www.fandom.com/topics/movies).
- 1. Activate the element selection tool in your DevTools.
- 1. Click on the list item for the top Fandom wiki in the category.
- 1. Notice that it has a class `topic_explore-wikis__link`.
- 1. In the **Console**, execute `document.querySelector('.topic_explore-wikis__link')`. This returns the element representing the top list item. They use the selector only for the **Top Wikis** list, and because `document.querySelector()` returns the first matching element, you're almost done.
- 1. Save the element in a variable by executing `item = document.querySelector('.topic_explore-wikis__link')`.
- 1. Get the element's text without extra white space by executing `item.textContent.trim()`. At the time of writing, this returns `"Pixar Wiki"`.
+1. Open the [Movies page](https://www.fandom.com/topics/movies).
+1. Activate the element selection tool in your DevTools.
+1. Click on the list item for the top Fandom wiki in the category.
+1. Notice that it has a class `topic_explore-wikis__link`.
+1. In the **Console**, execute `document.querySelector('.topic_explore-wikis__link')`. This returns the element representing the top list item. They use the selector only for the **Top Wikis** list, and because `document.querySelector()` returns the first matching element, you're almost done.
+1. Save the element in a variable by executing `item = document.querySelector('.topic_explore-wikis__link')`.
+1. Get the element's text without extra white space by executing `item.textContent.trim()`. At the time of writing, this returns `"Pixar Wiki"`.
@@ -126,13 +126,13 @@ On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone),
Solution
- 1. Open the [F1 news page](https://www.theguardian.com/sport/formulaone).
- 1. Activate the element selection tool in your DevTools.
- 1. Click on the first post.
- 1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tag names and randomized classes, requiring you to rely on the element hierarchy and order instead.
- 1. In the **Console**, execute `post = document.querySelector('#maincontent ul li')`. This returns the element representing the first post.
- 1. Extract the post's title by executing `post.querySelector('h3').textContent`.
- 1. Extract the lead paragraph by executing `post.querySelector('span div').textContent`.
- 1. Extract the photo URL by executing `post.querySelector('img').src`.
+1. Open the [F1 news page](https://www.theguardian.com/sport/formulaone).
+1. Activate the element selection tool in your DevTools.
+1. Click on the first post.
+1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tag names and randomized classes, requiring you to rely on the element hierarchy and order instead.
+1. In the **Console**, execute `post = document.querySelector('#maincontent ul li')`. This returns the element representing the first post.
+1. Extract the post's title by executing `post.querySelector('h3').textContent`.
+1. Extract the lead paragraph by executing `post.querySelector('span div').textContent`.
+1. Extract the photo URL by executing `post.querySelector('img').src`.
diff --git a/sources/academy/webscraping/scraping_basics_javascript/04_downloading_html.md b/sources/academy/webscraping/scraping_basics_javascript/04_downloading_html.md
index dd5ebfb5b0..eb63ff4e8f 100644
--- a/sources/academy/webscraping/scraping_basics_javascript/04_downloading_html.md
+++ b/sources/academy/webscraping/scraping_basics_javascript/04_downloading_html.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/downloading-html
---
import LegacyJsCourseAdmonition from '@site/src/components/LegacyJsCourseAdmonition';
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
@@ -94,7 +94,7 @@ SyntaxError: Cannot use import statement outside a module
Now onto coding! Let's change our code so it downloads HTML of the product listing instead of printing `All is OK`. The [documentation of the Fetch API](https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch) provides us with examples how to use it. Inspired by those, our code will look like this:
```js
-const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const url = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
const response = await fetch(url);
console.log(await response.text());
```
@@ -155,13 +155,13 @@ https://warehouse-theme-metal.myshopify.com/does/not/exist
We could check the value of `response.status` against a list of allowed numbers, but the Fetch API already provides `response.ok`, a property which returns `false` if our request wasn't successful:
```js
-const url = "https://warehouse-theme-metal.myshopify.com/does/not/exist";
+const url = 'https://warehouse-theme-metal.myshopify.com/does/not/exist';
const response = await fetch(url);
if (response.ok) {
- console.log(await response.text());
+ console.log(await response.text());
} else {
- throw new Error(`HTTP ${response.status}`);
+ throw new Error(`HTTP ${response.status}`);
}
```
@@ -195,16 +195,16 @@ https://www.aliexpress.com/w/wholesale-darth-vader.html
Solution
- ```js
- const url = "https://www.aliexpress.com/w/wholesale-darth-vader.html";
- const response = await fetch(url);
+```js
+const url = 'https://www.aliexpress.com/w/wholesale-darth-vader.html';
+const response = await fetch(url);
- if (response.ok) {
+if (response.ok) {
console.log(await response.text());
- } else {
+} else {
throw new Error(`HTTP ${response.status}`);
- }
- ```
+}
+```
@@ -219,27 +219,27 @@ https://warehouse-theme-metal.myshopify.com/collections/sales
Solution
- Right in your Terminal or Command Prompt, you can create files by _redirecting output_ of command line programs:
+Right in your Terminal or Command Prompt, you can create files by _redirecting output_ of command line programs:
- ```text
- node index.js > products.html
- ```
+```text
+node index.js > products.html
+```
- If you want to use Node.js instead, it offers several ways how to create files. The solution below uses the [Promises API](https://nodejs.org/api/fs.html#promises-api):
+If you want to use Node.js instead, it offers several ways how to create files. The solution below uses the [Promises API](https://nodejs.org/api/fs.html#promises-api):
- ```js
- import { writeFile } from 'node:fs/promises';
+```js
+import { writeFile } from 'node:fs/promises';
- const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
- const response = await fetch(url);
+const url = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
+const response = await fetch(url);
- if (response.ok) {
+if (response.ok) {
const html = await response.text();
await writeFile('products.html', html);
- } else {
+} else {
throw new Error(`HTTP ${response.status}`);
- }
- ```
+}
+```
@@ -254,20 +254,21 @@ https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72
Solution
- Node.js offers several ways how to create files. The solution below uses [Promises API](https://nodejs.org/api/fs.html#promises-api):
+Node.js offers several ways how to create files. The solution below uses [Promises API](https://nodejs.org/api/fs.html#promises-api):
- ```js
- import { writeFile } from 'node:fs/promises';
+```js
+import { writeFile } from 'node:fs/promises';
- const url = "https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg";
- const response = await fetch(url);
+const url =
+ 'https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg';
+const response = await fetch(url);
- if (response.ok) {
+if (response.ok) {
const buffer = Buffer.from(await response.arrayBuffer());
await writeFile('tv.jpg', buffer);
- } else {
+} else {
throw new Error(`HTTP ${response.status}`);
- }
- ```
+}
+```
diff --git a/sources/academy/webscraping/scraping_basics_javascript/05_parsing_html.md b/sources/academy/webscraping/scraping_basics_javascript/05_parsing_html.md
index 78604a16fa..620bd20c5b 100644
--- a/sources/academy/webscraping/scraping_basics_javascript/05_parsing_html.md
+++ b/sources/academy/webscraping/scraping_basics_javascript/05_parsing_html.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/parsing-html
---
import LegacyJsCourseAdmonition from '@site/src/components/LegacyJsCourseAdmonition';
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
@@ -59,15 +59,15 @@ We'll update our code to the following:
```js
import * as cheerio from 'cheerio';
-const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const url = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
const response = await fetch(url);
if (response.ok) {
- const html = await response.text();
- const $ = cheerio.load(html);
- console.log($("h1"));
+ const html = await response.text();
+ const $ = cheerio.load(html);
+ console.log($('h1'));
} else {
- throw new Error(`HTTP ${response.status}`);
+ throw new Error(`HTTP ${response.status}`);
}
```
@@ -104,16 +104,16 @@ The item has many properties, such as references to its parent or sibling elemen
```js
import * as cheerio from 'cheerio';
-const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const url = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
const response = await fetch(url);
if (response.ok) {
- const html = await response.text();
- const $ = cheerio.load(html);
- // highlight-next-line
- console.log($("h1").text());
+ const html = await response.text();
+ const $ = cheerio.load(html);
+ // highlight-next-line
+ console.log($('h1').text());
} else {
- throw new Error(`HTTP ${response.status}`);
+ throw new Error(`HTTP ${response.status}`);
}
```
@@ -139,16 +139,16 @@ Scanning through [usage examples](https://cheerio.js.org/docs/basics/selecting)
```js
import * as cheerio from 'cheerio';
-const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const url = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
const response = await fetch(url);
if (response.ok) {
- const html = await response.text();
- const $ = cheerio.load(html);
- // highlight-next-line
- console.log($(".product-item").length);
+ const html = await response.text();
+ const $ = cheerio.load(html);
+ // highlight-next-line
+ console.log($('.product-item').length);
} else {
- throw new Error(`HTTP ${response.status}`);
+ throw new Error(`HTTP ${response.status}`);
}
```
@@ -184,20 +184,20 @@ https://www.f1academy.com/Racing-Series/Teams
Solution
- ```js
- import * as cheerio from 'cheerio';
+```js
+import * as cheerio from 'cheerio';
- const url = "https://www.f1academy.com/Racing-Series/Teams";
- const response = await fetch(url);
+const url = 'https://www.f1academy.com/Racing-Series/Teams';
+const response = await fetch(url);
- if (response.ok) {
+if (response.ok) {
const html = await response.text();
const $ = cheerio.load(html);
- console.log($(".teams-driver-item").length);
- } else {
+ console.log($('.teams-driver-item').length);
+} else {
throw new Error(`HTTP ${response.status}`);
- }
- ```
+}
+```
@@ -208,19 +208,19 @@ Use the same URL as in the previous exercise, but this time print a total count
Solution
- ```js
- import * as cheerio from 'cheerio';
+```js
+import * as cheerio from 'cheerio';
- const url = "https://www.f1academy.com/Racing-Series/Teams";
- const response = await fetch(url);
+const url = 'https://www.f1academy.com/Racing-Series/Teams';
+const response = await fetch(url);
- if (response.ok) {
+if (response.ok) {
const html = await response.text();
const $ = cheerio.load(html);
- console.log($(".driver").length);
- } else {
+ console.log($('.driver').length);
+} else {
throw new Error(`HTTP ${response.status}`);
- }
- ```
+}
+```
diff --git a/sources/academy/webscraping/scraping_basics_javascript/06_locating_elements.md b/sources/academy/webscraping/scraping_basics_javascript/06_locating_elements.md
index 8193597053..fd9af9f498 100644
--- a/sources/academy/webscraping/scraping_basics_javascript/06_locating_elements.md
+++ b/sources/academy/webscraping/scraping_basics_javascript/06_locating_elements.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/locating-elements
---
import LegacyJsCourseAdmonition from '@site/src/components/LegacyJsCourseAdmonition';
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
@@ -19,19 +19,19 @@ In the previous lesson we've managed to print text of the page's main heading or
```js
import * as cheerio from 'cheerio';
-const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const url = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
const response = await fetch(url);
if (response.ok) {
- const html = await response.text();
- const $ = cheerio.load(html);
- // highlight-start
- for (const element of $(".product-item").toArray()) {
- console.log($(element).text());
- }
- // highlight-end
+ const html = await response.text();
+ const $ = cheerio.load(html);
+ // highlight-start
+ for (const element of $('.product-item').toArray()) {
+ console.log($(element).text());
+ }
+ // highlight-end
} else {
- throw new Error(`HTTP ${response.status}`);
+ throw new Error(`HTTP ${response.status}`);
}
```
@@ -73,26 +73,26 @@ We should be looking for elements which have the `product-item__title` and `pric
```js
import * as cheerio from 'cheerio';
-const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const url = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
const response = await fetch(url);
if (response.ok) {
- const html = await response.text();
- const $ = cheerio.load(html);
+ const html = await response.text();
+ const $ = cheerio.load(html);
- for (const element of $(".product-item").toArray()) {
- const $productItem = $(element);
+ for (const element of $('.product-item').toArray()) {
+ const $productItem = $(element);
- const $title = $productItem.find(".product-item__title");
- const title = $title.text();
+ const $title = $productItem.find('.product-item__title');
+ const title = $title.text();
- const $price = $productItem.find(".price");
- const price = $price.text();
+ const $price = $productItem.find('.price');
+ const price = $price.text();
- console.log(`${title} | ${price}`);
- }
+ console.log(`${title} | ${price}`);
+ }
} else {
- throw new Error(`HTTP ${response.status}`);
+ throw new Error(`HTTP ${response.status}`);
}
```
@@ -121,8 +121,8 @@ In the output we can see that the price isn't located precisely. For each produc
```html
- Sale price
- $74.95
+ Sale price
+ $74.95
```
@@ -169,27 +169,27 @@ It seems like we can read the last element to get the actual amount. Let's fix o
```js
import * as cheerio from 'cheerio';
-const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const url = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
const response = await fetch(url);
if (response.ok) {
- const html = await response.text();
- const $ = cheerio.load(html);
+ const html = await response.text();
+ const $ = cheerio.load(html);
- for (const element of $(".product-item").toArray()) {
- const $productItem = $(element);
+ for (const element of $('.product-item').toArray()) {
+ const $productItem = $(element);
- const $title = $productItem.find(".product-item__title");
- const title = $title.text();
+ const $title = $productItem.find('.product-item__title');
+ const title = $title.text();
- // highlight-next-line
- const $price = $productItem.find(".price").contents().last();
- const price = $price.text();
+ // highlight-next-line
+ const $price = $productItem.find('.price').contents().last();
+ const price = $price.text();
- console.log(`${title} | ${price}`);
- }
+ console.log(`${title} | ${price}`);
+ }
} else {
- throw new Error(`HTTP ${response.status}`);
+ throw new Error(`HTTP ${response.status}`);
}
```
@@ -239,37 +239,38 @@ Djibouti
Solution
- ```js
- import * as cheerio from 'cheerio';
+```js
+import * as cheerio from 'cheerio';
- const url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa";
- const response = await fetch(url);
+const url =
+ 'https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa';
+const response = await fetch(url);
- if (response.ok) {
+if (response.ok) {
const html = await response.text();
const $ = cheerio.load(html);
- for (const tableElement of $(".wikitable").toArray()) {
- const $table = $(tableElement);
- const $rows = $table.find("tr");
+ for (const tableElement of $('.wikitable').toArray()) {
+ const $table = $(tableElement);
+ const $rows = $table.find('tr');
- for (const rowElement of $rows.toArray()) {
- const $row = $(rowElement);
- const $cells = $row.find("td");
+ for (const rowElement of $rows.toArray()) {
+ const $row = $(rowElement);
+ const $cells = $row.find('td');
- if ($cells.length > 0) {
- const $thirdColumn = $($cells[2]);
- const $link = $thirdColumn.find("a").first();
- console.log($link.text());
+ if ($cells.length > 0) {
+ const $thirdColumn = $($cells[2]);
+ const $link = $thirdColumn.find('a').first();
+ console.log($link.text());
+ }
}
- }
}
- } else {
+} else {
throw new Error(`HTTP ${response.status}`);
- }
- ```
+}
+```
- Because some rows contain [table headers](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/th), we skip processing a row if `table_row.select("td")` doesn't find any [table data](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/td) cells.
+Because some rows contain [table headers](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/th), we skip processing a row if `table_row.select("td")` doesn't find any [table data](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/td) cells.
@@ -289,25 +290,26 @@ You may want to check out the following pages:
Solution
- ```js
- import * as cheerio from 'cheerio';
+```js
+import * as cheerio from 'cheerio';
- const url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa";
- const response = await fetch(url);
+const url =
+ 'https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa';
+const response = await fetch(url);
- if (response.ok) {
+if (response.ok) {
const html = await response.text();
const $ = cheerio.load(html);
- for (const element of $(".wikitable tr td:nth-child(3)").toArray()) {
- const $nameCell = $(element);
- const $link = $nameCell.find("a").first();
- console.log($link.text());
+ for (const element of $('.wikitable tr td:nth-child(3)').toArray()) {
+ const $nameCell = $(element);
+ const $link = $nameCell.find('a').first();
+ console.log($link.text());
}
- } else {
+} else {
throw new Error(`HTTP ${response.status}`);
- }
- ```
+}
+```
@@ -331,22 +333,22 @@ Max Verstappen wins Canadian Grand Prix: F1 – as it happened
Solution
- ```js
- import * as cheerio from 'cheerio';
+```js
+import * as cheerio from 'cheerio';
- const url = "https://www.theguardian.com/sport/formulaone";
- const response = await fetch(url);
+const url = 'https://www.theguardian.com/sport/formulaone';
+const response = await fetch(url);
- if (response.ok) {
+if (response.ok) {
const html = await response.text();
const $ = cheerio.load(html);
- for (const element of $("#maincontent ul li h3").toArray()) {
- console.log($(element).text());
+ for (const element of $('#maincontent ul li h3').toArray()) {
+ console.log($(element).text());
}
- } else {
+} else {
throw new Error(`HTTP ${response.status}`);
- }
- ```
+}
+```
diff --git a/sources/academy/webscraping/scraping_basics_javascript/07_extracting_data.md b/sources/academy/webscraping/scraping_basics_javascript/07_extracting_data.md
index dae6bda605..f542658317 100644
--- a/sources/academy/webscraping/scraping_basics_javascript/07_extracting_data.md
+++ b/sources/academy/webscraping/scraping_basics_javascript/07_extracting_data.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/extracting-data
---
import LegacyJsCourseAdmonition from '@site/src/components/LegacyJsCourseAdmonition';
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
@@ -38,16 +38,16 @@ It's because some products have variants with different prices. Later in the cou
Ideally we'd go and discuss the problem with those who are about to use the resulting data. For their purposes, is the fact that some prices are just minimum prices important? What would be the most useful representation of the range for them? Maybe they'd tell us that it's okay if we just remove the `From` prefix?
```js
-const priceText = $price.text().replace("From ", "");
+const priceText = $price.text().replace('From ', '');
```
In other cases, they'd tell us the data must include the range. And in cases when we just don't know, the safest option is to include all the information we have and leave the decision on what's important to later stages. One approach could be having the exact and minimum prices as separate values. If we don't know the exact price, we leave it empty:
```js
const priceRange = { minPrice: null, price: null };
-const priceText = $price.text()
-if (priceText.startsWith("From ")) {
- priceRange.minPrice = priceText.replace("From ", "");
+const priceText = $price.text();
+if (priceText.startsWith('From ')) {
+ priceRange.minPrice = priceText.replace('From ', '');
} else {
priceRange.minPrice = priceText;
priceRange.price = priceRange.minPrice;
@@ -65,39 +65,39 @@ The whole program would look like this:
```js
import * as cheerio from 'cheerio';
-const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const url = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
const response = await fetch(url);
if (response.ok) {
- const html = await response.text();
- const $ = cheerio.load(html);
-
- for (const element of $(".product-item").toArray()) {
- const $productItem = $(element);
-
- const $title = $productItem.find(".product-item__title");
- const title = $title.text();
-
- const $price = $productItem.find(".price").contents().last();
- const priceRange = { minPrice: null, price: null };
- const priceText = $price.text();
- if (priceText.startsWith("From ")) {
- priceRange.minPrice = priceText.replace("From ", "");
- } else {
- priceRange.minPrice = priceText;
- priceRange.price = priceRange.minPrice;
- }
+ const html = await response.text();
+ const $ = cheerio.load(html);
+
+ for (const element of $('.product-item').toArray()) {
+ const $productItem = $(element);
+
+ const $title = $productItem.find('.product-item__title');
+ const title = $title.text();
+
+ const $price = $productItem.find('.price').contents().last();
+ const priceRange = { minPrice: null, price: null };
+ const priceText = $price.text();
+ if (priceText.startsWith('From ')) {
+ priceRange.minPrice = priceText.replace('From ', '');
+ } else {
+ priceRange.minPrice = priceText;
+ priceRange.price = priceRange.minPrice;
+ }
- console.log(`${title} | ${priceRange.minPrice} | ${priceRange.price}`);
- }
+ console.log(`${title} | ${priceRange.minPrice} | ${priceRange.price}`);
+ }
} else {
- throw new Error(`HTTP ${response.status}`);
+ throw new Error(`HTTP ${response.status}`);
}
```
## Removing white space
-Often, the strings we extract from a web page start or end with some amount of whitespace, typically space characters or newline characters, which come from the [indentation](https://en.wikipedia.org/wiki/Indentation_(typesetting)#Indentation_in_programming) of the HTML tags.
+Often, the strings we extract from a web page start or end with some amount of whitespace, typically space characters or newline characters, which come from the [indentation]() of the HTML tags.
We call the operation of removing whitespace _trimming_ or _stripping_, and it's so useful in many applications that programming languages and libraries include ready-made tools for it. Let's add JavaScript's built-in [.trim()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/trim):
@@ -126,11 +126,7 @@ The demonstration above is inside the Node.js' [interactive REPL](https://nodejs
We need to remove the dollar sign and the decimal commas. For this type of cleaning, [regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions) are often the best tool for the job, but in this case [`.replace()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace) is also sufficient:
```js
-const priceText = $price
- .text()
- .trim()
- .replace("$", "")
- .replace(",", "");
+const priceText = $price.text().trim().replace('$', '').replace(',', '');
```
## Representing money in programs
@@ -139,9 +135,9 @@ Now we should be able to add `parseFloat()`, so that we have the prices not as a
```js
const priceRange = { minPrice: null, price: null };
-const priceText = $price.text()
-if (priceText.startsWith("From ")) {
- priceRange.minPrice = parseFloat(priceText.replace("From ", ""));
+const priceText = $price.text();
+if (priceText.startsWith('From ')) {
+ priceRange.minPrice = parseFloat(priceText.replace('From ', ''));
} else {
priceRange.minPrice = parseFloat(priceText);
priceRange.price = priceRange.minPrice;
@@ -159,12 +155,12 @@ These errors are small and usually don't matter, but sometimes they can add up a
```js
const priceText = $price
- .text()
- .trim()
- .replace("$", "")
-// highlight-next-line
- .replace(".", "")
- .replace(",", "");
+ .text()
+ .trim()
+ .replace('$', '')
+ // highlight-next-line
+ .replace('.', '')
+ .replace(',', '');
```
In this case, removing the dot from the price text is the same as if we multiplied all the numbers with 100, effectively converting dollars to cents. This is how the whole program looks like now:
@@ -172,39 +168,34 @@ In this case, removing the dot from the price text is the same as if we multipli
```js
import * as cheerio from 'cheerio';
-const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const url = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
const response = await fetch(url);
if (response.ok) {
- const html = await response.text();
- const $ = cheerio.load(html);
-
- for (const element of $(".product-item").toArray()) {
- const $productItem = $(element);
-
- const $title = $productItem.find(".product-item__title");
- const titleText = $title.text().trim();
-
- const $price = $productItem.find(".price").contents().last();
- const priceRange = { minPrice: null, price: null };
- const priceText = $price
- .text()
- .trim()
- .replace("$", "")
- .replace(".", "")
- .replace(",", "");
-
- if (priceText.startsWith("From ")) {
- priceRange.minPrice = parseInt(priceText.replace("From ", ""));
- } else {
- priceRange.minPrice = parseInt(priceText);
- priceRange.price = priceRange.minPrice;
- }
+ const html = await response.text();
+ const $ = cheerio.load(html);
- console.log(`${title} | ${priceRange.minPrice} | ${priceRange.price}`);
- }
+ for (const element of $('.product-item').toArray()) {
+ const $productItem = $(element);
+
+ const $title = $productItem.find('.product-item__title');
+ const titleText = $title.text().trim();
+
+ const $price = $productItem.find('.price').contents().last();
+ const priceRange = { minPrice: null, price: null };
+ const priceText = $price.text().trim().replace('$', '').replace('.', '').replace(',', '');
+
+ if (priceText.startsWith('From ')) {
+ priceRange.minPrice = parseInt(priceText.replace('From ', ''));
+ } else {
+ priceRange.minPrice = parseInt(priceText);
+ priceRange.price = priceRange.minPrice;
+ }
+
+ console.log(`${title} | ${priceRange.minPrice} | ${priceRange.price}`);
+ }
} else {
- throw new Error(`HTTP ${response.status}`);
+ throw new Error(`HTTP ${response.status}`);
}
```
@@ -240,47 +231,47 @@ Denon AH-C720 In-Ear Headphones | 236
Solution
- ```js
- import * as cheerio from 'cheerio';
+```js
+import * as cheerio from 'cheerio';
- function parseUnitsText(text) {
+function parseUnitsText(text) {
const count = text
- .replace("In stock,", "")
- .replace("Only", "")
- .replace(" left", "")
- .replace("units", "")
- .trim();
- return count === "Sold out" ? 0 : parseInt(count);
- }
-
- const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
- const response = await fetch(url);
-
- if (response.ok) {
+ .replace('In stock,', '')
+ .replace('Only', '')
+ .replace(' left', '')
+ .replace('units', '')
+ .trim();
+ return count === 'Sold out' ? 0 : parseInt(count);
+}
+
+const url = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
+const response = await fetch(url);
+
+if (response.ok) {
const html = await response.text();
const $ = cheerio.load(html);
- for (const element of $(".product-item").toArray()) {
- const $productItem = $(element);
+ for (const element of $('.product-item').toArray()) {
+ const $productItem = $(element);
- const title = $productItem.find(".product-item__title");
- const title = $title.text().trim();
+ const title = $productItem.find('.product-item__title');
+ const title = $title.text().trim();
- const unitsText = $productItem.find(".product-item__inventory").text();
- const unitsCount = parseUnitsText(unitsText);
+ const unitsText = $productItem.find('.product-item__inventory').text();
+ const unitsCount = parseUnitsText(unitsText);
- console.log(`${title} | ${unitsCount}`);
+ console.log(`${title} | ${unitsCount}`);
}
- } else {
+} else {
throw new Error(`HTTP ${response.status}`);
- }
- ```
+}
+```
- :::tip Conditional (ternary) operator
+:::tip Conditional (ternary) operator
- For brevity, the solution uses the [conditional (ternary) operator](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Conditional_operator). You can achieve the same with a plain `if` and `else` block.
+For brevity, the solution uses the [conditional (ternary) operator](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Conditional_operator). You can achieve the same with a plain `if` and `else` block.
- :::
+:::
@@ -291,45 +282,45 @@ Simplify the code from previous exercise. Use [regular expressions](https://deve
Solution
- ```js
- import * as cheerio from 'cheerio';
+```js
+import * as cheerio from 'cheerio';
- function parseUnitsText(text) {
+function parseUnitsText(text) {
const match = text.match(/\d+/);
if (match) {
- return parseInt(match[0]);
+ return parseInt(match[0]);
}
return 0;
- }
+}
- const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
- const response = await fetch(url);
+const url = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
+const response = await fetch(url);
- if (response.ok) {
+if (response.ok) {
const html = await response.text();
const $ = cheerio.load(html);
- for (const element of $(".product-item").toArray()) {
- const $productItem = $(element);
+ for (const element of $('.product-item').toArray()) {
+ const $productItem = $(element);
- const $title = $productItem.find(".product-item__title");
- const title = $title.text().trim();
+ const $title = $productItem.find('.product-item__title');
+ const title = $title.text().trim();
- const unitsText = $productItem.find(".product-item__inventory").text();
- const unitsCount = parseUnitsText(unitsText);
+ const unitsText = $productItem.find('.product-item__inventory').text();
+ const unitsCount = parseUnitsText(unitsText);
- console.log(`${title} | ${unitsCount}`);
+ console.log(`${title} | ${unitsCount}`);
}
- } else {
+} else {
throw new Error(`HTTP ${response.status}`);
- }
- ```
+}
+```
- :::tip Conditional (ternary) operator
+:::tip Conditional (ternary) operator
- For brevity, the solution uses the [conditional (ternary) operator](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Conditional_operator). You can achieve the same with a plain `if` and `else` block.
+For brevity, the solution uses the [conditional (ternary) operator](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Conditional_operator). You can achieve the same with a plain `if` and `else` block.
- :::
+:::
@@ -363,34 +354,28 @@ Hamilton reveals distress over ‘devastating’ groundhog accident at Canadian
Solution
- ```js
- import * as cheerio from 'cheerio';
+```js
+import * as cheerio from 'cheerio';
- const url = "https://www.theguardian.com/sport/formulaone";
- const response = await fetch(url);
+const url = 'https://www.theguardian.com/sport/formulaone';
+const response = await fetch(url);
- if (response.ok) {
+if (response.ok) {
const html = await response.text();
const $ = cheerio.load(html);
- for (const element of $("#maincontent ul li").toArray()) {
- const $article = $(element);
+ for (const element of $('#maincontent ul li').toArray()) {
+ const $article = $(element);
- const title = $article
- .find("h3")
- .text()
- .trim();
- const dateText = $article
- .find("time")
- .attr("datetime")
- .trim();
- const date = new Date(dateText);
+ const title = $article.find('h3').text().trim();
+ const dateText = $article.find('time').attr('datetime').trim();
+ const date = new Date(dateText);
- console.log(`${title} | ${date.toDateString()}`);
+ console.log(`${title} | ${date.toDateString()}`);
}
- } else {
+} else {
throw new Error(`HTTP ${response.status}`);
- }
- ```
+}
+```
diff --git a/sources/academy/webscraping/scraping_basics_javascript/08_saving_data.md b/sources/academy/webscraping/scraping_basics_javascript/08_saving_data.md
index bd960e9b5c..8d1e48c4ec 100644
--- a/sources/academy/webscraping/scraping_basics_javascript/08_saving_data.md
+++ b/sources/academy/webscraping/scraping_basics_javascript/08_saving_data.md
@@ -33,43 +33,45 @@ Producing results line by line is an efficient approach to handling large datase
```js
import * as cheerio from 'cheerio';
-const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const url = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
const response = await fetch(url);
if (response.ok) {
- const html = await response.text();
- const $ = cheerio.load(html);
-
- // highlight-next-line
- const data = $(".product-item").toArray().map(element => {
- const $productItem = $(element);
-
- const $title = $productItem.find(".product-item__title");
- const title = $title.text().trim();
-
- const $price = $productItem.find(".price").contents().last();
- const priceRange = { minPrice: null, price: null };
- const priceText = $price
- .text()
- .trim()
- .replace("$", "")
- .replace(".", "")
- .replace(",", "");
-
- if (priceText.startsWith("From ")) {
- priceRange.minPrice = parseInt(priceText.replace("From ", ""));
- } else {
- priceRange.minPrice = parseInt(priceText);
- priceRange.price = priceRange.minPrice;
- }
+ const html = await response.text();
+ const $ = cheerio.load(html);
// highlight-next-line
- return { title, ...priceRange };
- });
- // highlight-next-line
- console.log(data);
+ const data = $('.product-item')
+ .toArray()
+ .map((element) => {
+ const $productItem = $(element);
+
+ const $title = $productItem.find('.product-item__title');
+ const title = $title.text().trim();
+
+ const $price = $productItem.find('.price').contents().last();
+ const priceRange = { minPrice: null, price: null };
+ const priceText = $price
+ .text()
+ .trim()
+ .replace('$', '')
+ .replace('.', '')
+ .replace(',', '');
+
+ if (priceText.startsWith('From ')) {
+ priceRange.minPrice = parseInt(priceText.replace('From ', ''));
+ } else {
+ priceRange.minPrice = parseInt(priceText);
+ priceRange.price = priceRange.minPrice;
+ }
+
+ // highlight-next-line
+ return { title, ...priceRange };
+ });
+ // highlight-next-line
+ console.log(data);
} else {
- throw new Error(`HTTP ${response.status}`);
+ throw new Error(`HTTP ${response.status}`);
}
```
@@ -117,7 +119,7 @@ We'll begin with importing the `writeFile` function from the Node.js standard li
```js
import * as cheerio from 'cheerio';
// highlight-next-line
-import { writeFile } from "fs/promises";
+import { writeFile } from 'fs/promises';
```
Next, instead of printing the data, we'll finish the program by exporting it to JSON. Let's replace the line `console.log(data)` with the following:
@@ -130,6 +132,7 @@ await writeFile('products.json', jsonData);
That's it! If we run our scraper now, it won't display any output, but it will create a `products.json` file in the current working directory, which contains all the data about the listed products:
+
```json title=products.json
[{"title":"JBL Flip 4 Waterproof Portable Bluetooth Speaker","minPrice":7495,"price":7495},{"title":"Sony XBR-950G BRAVIA 4K HDR Ultra HD TV","minPrice":139800,"price":null},...]
```
@@ -137,7 +140,11 @@ That's it! If we run our scraper now, it won't display any output, but it will c
If you skim through the data, you'll notice that the `JSON.stringify()` function handled some potential issues, such as escaping double quotes found in one of the titles by adding a backslash:
```json
-{"title":"Sony SACS9 10\" Active Subwoofer","minPrice":15800,"price":15800}
+{
+ "title": "Sony SACS9 10\" Active Subwoofer",
+ "minPrice": 15800,
+ "price": 15800
+}
```
:::tip Pretty JSON
@@ -163,7 +170,7 @@ Once installed, we can add the following line to our imports:
```js
import * as cheerio from 'cheerio';
-import { writeFile } from "fs/promises";
+import { writeFile } from 'fs/promises';
// highlight-next-line
import { AsyncParser } from '@json2csv/node';
```
@@ -176,7 +183,7 @@ await writeFile('products.json', jsonData);
const parser = new AsyncParser();
const csvData = await parser.parse(data).promise();
-await writeFile("products.csv", csvData);
+await writeFile('products.csv', csvData);
```
The program should now also produce a `data.csv` file. When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it. If you're using a different operating system, try opening the file with any spreadsheet program you have.
@@ -210,15 +217,13 @@ Write a new Node.js program that reads the `products.json` file we created in th
Solution
- ```js
- import { readFile } from "fs/promises";
+```js
+import { readFile } from 'fs/promises';
- const jsonData = await readFile("products.json");
- const data = JSON.parse(jsonData);
- data
- .filter(row => row.minPrice > 50000)
- .forEach(row => console.log(row));
- ```
+const jsonData = await readFile('products.json');
+const data = JSON.parse(jsonData);
+data.filter((row) => row.minPrice > 50000).forEach((row) => console.log(row));
+```
@@ -229,12 +234,12 @@ Open the `products.csv` file we created in the lesson using a spreadsheet applic
Solution
- Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account:
+Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account:
- 1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data.
- 1. Select the header row. Go to **Data > Create filter**.
- 1. Use the filter icon that appears next to `minPrice`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.
+1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data.
+1. Select the header row. Go to **Data > Create filter**.
+1. Use the filter icon that appears next to `minPrice`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.
- 
+
diff --git a/sources/academy/webscraping/scraping_basics_javascript/09_getting_links.md b/sources/academy/webscraping/scraping_basics_javascript/09_getting_links.md
index e923a8875d..68559ac8cd 100644
--- a/sources/academy/webscraping/scraping_basics_javascript/09_getting_links.md
+++ b/sources/academy/webscraping/scraping_basics_javascript/09_getting_links.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/getting-links
---
import LegacyJsCourseAdmonition from '@site/src/components/LegacyJsCourseAdmonition';
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
@@ -38,46 +38,48 @@ import * as cheerio from 'cheerio';
import { writeFile } from 'fs/promises';
import { AsyncParser } from '@json2csv/node';
-const url = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const url = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
const response = await fetch(url);
if (response.ok) {
- const html = await response.text();
- const $ = cheerio.load(html);
-
- const data = $(".product-item").toArray().map(element => {
- const $productItem = $(element);
-
- const $title = $productItem.find(".product-item__title");
- const title = $title.text().trim();
-
- const $price = $productItem.find(".price").contents().last();
- const priceRange = { minPrice: null, price: null };
- const priceText = $price
- .text()
- .trim()
- .replace("$", "")
- .replace(".", "")
- .replace(",", "");
-
- if (priceText.startsWith("From ")) {
- priceRange.minPrice = parseInt(priceText.replace("From ", ""));
- } else {
- priceRange.minPrice = parseInt(priceText);
- priceRange.price = priceRange.minPrice;
- }
-
- return { title, ...priceRange };
- });
-
- const jsonData = JSON.stringify(data);
- await writeFile('products.json', jsonData);
+ const html = await response.text();
+ const $ = cheerio.load(html);
- const parser = new AsyncParser();
- const csvData = await parser.parse(data).promise();
- await writeFile('products.csv', csvData);
+ const data = $('.product-item')
+ .toArray()
+ .map((element) => {
+ const $productItem = $(element);
+
+ const $title = $productItem.find('.product-item__title');
+ const title = $title.text().trim();
+
+ const $price = $productItem.find('.price').contents().last();
+ const priceRange = { minPrice: null, price: null };
+ const priceText = $price
+ .text()
+ .trim()
+ .replace('$', '')
+ .replace('.', '')
+ .replace(',', '');
+
+ if (priceText.startsWith('From ')) {
+ priceRange.minPrice = parseInt(priceText.replace('From ', ''));
+ } else {
+ priceRange.minPrice = parseInt(priceText);
+ priceRange.price = priceRange.minPrice;
+ }
+
+ return { title, ...priceRange };
+ });
+
+ const jsonData = JSON.stringify(data);
+ await writeFile('products.json', jsonData);
+
+ const parser = new AsyncParser();
+ const csvData = await parser.parse(data).promise();
+ await writeFile('products.csv', csvData);
} else {
- throw new Error(`HTTP ${response.status}`);
+ throw new Error(`HTTP ${response.status}`);
}
```
@@ -85,13 +87,13 @@ Let's introduce several functions to make the whole thing easier to digest. Firs
```js
async function download(url) {
- const response = await fetch(url);
- if (response.ok) {
- const html = await response.text();
- return cheerio.load(html);
- } else {
- throw new Error(`HTTP ${response.status}`);
- }
+ const response = await fetch(url);
+ if (response.ok) {
+ const html = await response.text();
+ return cheerio.load(html);
+ } else {
+ throw new Error(`HTTP ${response.status}`);
+ }
}
```
@@ -99,26 +101,21 @@ Next, we can put parsing into a `parseProduct()` function, which takes the produ
```js
function parseProduct($productItem) {
- const $title = $productItem.find(".product-item__title");
- const title = $title.text().trim();
+ const $title = $productItem.find('.product-item__title');
+ const title = $title.text().trim();
+
+ const $price = $productItem.find('.price').contents().last();
+ const priceRange = { minPrice: null, price: null };
+ const priceText = $price.text().trim().replace('$', '').replace('.', '').replace(',', '');
- const $price = $productItem.find(".price").contents().last();
- const priceRange = { minPrice: null, price: null };
- const priceText = $price
- .text()
- .trim()
- .replace("$", "")
- .replace(".", "")
- .replace(",", "");
-
- if (priceText.startsWith("From ")) {
- priceRange.minPrice = parseInt(priceText.replace("From ", ""));
- } else {
- priceRange.minPrice = parseInt(priceText);
- priceRange.price = priceRange.minPrice;
- }
-
- return { title, ...priceRange };
+ if (priceText.startsWith('From ')) {
+ priceRange.minPrice = parseInt(priceText.replace('From ', ''));
+ } else {
+ priceRange.minPrice = parseInt(priceText);
+ priceRange.price = priceRange.minPrice;
+ }
+
+ return { title, ...priceRange };
}
```
@@ -126,7 +123,7 @@ Now the JSON export. For better readability, let's make a small change here and
```js
function exportJSON(data) {
- return JSON.stringify(data, null, 2);
+ return JSON.stringify(data, null, 2);
}
```
@@ -134,8 +131,8 @@ The last function we'll add will take care of the CSV export:
```js
async function exportCSV(data) {
- const parser = new AsyncParser();
- return await parser.parse(data).promise();
+ const parser = new AsyncParser();
+ return await parser.parse(data).promise();
}
```
@@ -147,55 +144,52 @@ import { writeFile } from 'fs/promises';
import { AsyncParser } from '@json2csv/node';
async function download(url) {
- const response = await fetch(url);
- if (response.ok) {
- const html = await response.text();
- return cheerio.load(html);
- } else {
- throw new Error(`HTTP ${response.status}`);
- }
+ const response = await fetch(url);
+ if (response.ok) {
+ const html = await response.text();
+ return cheerio.load(html);
+ } else {
+ throw new Error(`HTTP ${response.status}`);
+ }
}
function parseProduct($productItem) {
- const $title = $productItem.find(".product-item__title");
- const title = $title.text().trim();
+ const $title = $productItem.find('.product-item__title');
+ const title = $title.text().trim();
+
+ const $price = $productItem.find('.price').contents().last();
+ const priceRange = { minPrice: null, price: null };
+ const priceText = $price.text().trim().replace('$', '').replace('.', '').replace(',', '');
- const $price = $productItem.find(".price").contents().last();
- const priceRange = { minPrice: null, price: null };
- const priceText = $price
- .text()
- .trim()
- .replace("$", "")
- .replace(".", "")
- .replace(",", "");
-
- if (priceText.startsWith("From ")) {
- priceRange.minPrice = parseInt(priceText.replace("From ", ""));
- } else {
- priceRange.minPrice = parseInt(priceText);
- priceRange.price = priceRange.minPrice;
- }
-
- return { title, ...priceRange };
+ if (priceText.startsWith('From ')) {
+ priceRange.minPrice = parseInt(priceText.replace('From ', ''));
+ } else {
+ priceRange.minPrice = parseInt(priceText);
+ priceRange.price = priceRange.minPrice;
+ }
+
+ return { title, ...priceRange };
}
function exportJSON(data) {
- return JSON.stringify(data, null, 2);
+ return JSON.stringify(data, null, 2);
}
async function exportCSV(data) {
- const parser = new AsyncParser();
- return await parser.parse(data).promise();
+ const parser = new AsyncParser();
+ return await parser.parse(data).promise();
}
-const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const listingURL = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
const $ = await download(listingURL);
-const data = $(".product-item").toArray().map(element => {
- const $productItem = $(element);
- const item = parseProduct($productItem);
- return item;
-});
+const data = $('.product-item')
+ .toArray()
+ .map((element) => {
+ const $productItem = $(element);
+ const item = parseProduct($productItem);
+ return item;
+ });
await writeFile('products.json', exportJSON(data));
await writeFile('products.csv', await exportCSV(data));
@@ -240,6 +234,7 @@ function parseProduct($productItem) {
In the previous code example, we've also added the URL to the object returned by the function. If we run the scraper now, it should produce exports where each product contains a link to its product page:
+
```json title=products.json
[
{
@@ -283,20 +278,23 @@ function parseProduct($productItem, baseURL) {
Now we'll pass the base URL to the function in the main body of our program:
```js
-const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const listingURL = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
const $ = await download(listingURL);
-const data = $(".product-item").toArray().map(element => {
- const $productItem = $(element);
- // highlight-next-line
- const item = parseProduct($productItem, listingURL);
- return item;
-});
+const data = $('.product-item')
+ .toArray()
+ .map((element) => {
+ const $productItem = $(element);
+ // highlight-next-line
+ const item = parseProduct($productItem, listingURL);
+ return item;
+ });
```
When we run the scraper now, we should see full URLs in our exports:
+
```json title=products.json
[
{
@@ -342,26 +340,27 @@ https://en.wikipedia.org/wiki/Botswana
Solution
- ```js
- import * as cheerio from 'cheerio';
+```js
+import * as cheerio from 'cheerio';
- const listingURL = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa";
- const response = await fetch(listingURL);
+const listingURL =
+ 'https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa';
+const response = await fetch(listingURL);
- if (response.ok) {
+if (response.ok) {
const html = await response.text();
const $ = cheerio.load(html);
- for (const element of $(".wikitable tr td:nth-child(3)").toArray()) {
- const nameCell = $(element);
- const link = nameCell.find("a").first();
- const url = new URL(link.attr("href"), listingURL).href;
- console.log(url);
+ for (const element of $('.wikitable tr td:nth-child(3)').toArray()) {
+ const nameCell = $(element);
+ const link = nameCell.find('a').first();
+ const url = new URL(link.attr('href'), listingURL).href;
+ console.log(url);
}
- } else {
+} else {
throw new Error(`HTTP ${response.status}`);
- }
- ```
+}
+```
@@ -386,31 +385,31 @@ https://www.theguardian.com/sport/article/2024/sep/02/max-verstappen-damns-his-u
Solution
- ```js
- import * as cheerio from 'cheerio';
+```js
+import * as cheerio from 'cheerio';
- const listingURL = "https://www.theguardian.com/sport/formulaone";
- const response = await fetch(listingURL);
+const listingURL = 'https://www.theguardian.com/sport/formulaone';
+const response = await fetch(listingURL);
- if (response.ok) {
+if (response.ok) {
const html = await response.text();
const $ = cheerio.load(html);
- for (const element of $("#maincontent ul li").toArray()) {
- const link = $(element).find("a").first();
- const url = new URL(link.attr("href"), listingURL).href;
- console.log(url);
+ for (const element of $('#maincontent ul li').toArray()) {
+ const link = $(element).find('a').first();
+ const url = new URL(link.attr('href'), listingURL).href;
+ console.log(url);
}
- } else {
+} else {
throw new Error(`HTTP ${response.status}`);
- }
- ```
+}
+```
- Note that some cards contain two links. One leads to the article, and one to the comments. If we selected all the links in the list by `#maincontent ul li a`, we would get incorrect output like this:
+Note that some cards contain two links. One leads to the article, and one to the comments. If we selected all the links in the list by `#maincontent ul li a`, we would get incorrect output like this:
- ```text
- https://www.theguardian.com/sport/article/2024/sep/02/example
- https://www.theguardian.com/sport/article/2024/sep/02/example#comments
- ```
+```text
+https://www.theguardian.com/sport/article/2024/sep/02/example
+https://www.theguardian.com/sport/article/2024/sep/02/example#comments
+```
diff --git a/sources/academy/webscraping/scraping_basics_javascript/10_crawling.md b/sources/academy/webscraping/scraping_basics_javascript/10_crawling.md
index 926fd6d839..cfe222f90c 100644
--- a/sources/academy/webscraping/scraping_basics_javascript/10_crawling.md
+++ b/sources/academy/webscraping/scraping_basics_javascript/10_crawling.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/crawling
---
import LegacyJsCourseAdmonition from '@site/src/components/LegacyJsCourseAdmonition';
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
@@ -24,57 +24,54 @@ import { writeFile } from 'fs/promises';
import { AsyncParser } from '@json2csv/node';
async function download(url) {
- const response = await fetch(url);
- if (response.ok) {
- const html = await response.text();
- return cheerio.load(html);
- } else {
- throw new Error(`HTTP ${response.status}`);
- }
+ const response = await fetch(url);
+ if (response.ok) {
+ const html = await response.text();
+ return cheerio.load(html);
+ } else {
+ throw new Error(`HTTP ${response.status}`);
+ }
}
function parseProduct($productItem, baseURL) {
- const $title = $productItem.find(".product-item__title");
- const title = $title.text().trim();
- const url = new URL($title.attr("href"), baseURL).href;
-
- const $price = $productItem.find(".price").contents().last();
- const priceRange = { minPrice: null, price: null };
- const priceText = $price
- .text()
- .trim()
- .replace("$", "")
- .replace(".", "")
- .replace(",", "");
-
- if (priceText.startsWith("From ")) {
- priceRange.minPrice = parseInt(priceText.replace("From ", ""));
- } else {
- priceRange.minPrice = parseInt(priceText);
- priceRange.price = priceRange.minPrice;
- }
-
- return { url, title, ...priceRange };
+ const $title = $productItem.find('.product-item__title');
+ const title = $title.text().trim();
+ const url = new URL($title.attr('href'), baseURL).href;
+
+ const $price = $productItem.find('.price').contents().last();
+ const priceRange = { minPrice: null, price: null };
+ const priceText = $price.text().trim().replace('$', '').replace('.', '').replace(',', '');
+
+ if (priceText.startsWith('From ')) {
+ priceRange.minPrice = parseInt(priceText.replace('From ', ''));
+ } else {
+ priceRange.minPrice = parseInt(priceText);
+ priceRange.price = priceRange.minPrice;
+ }
+
+ return { url, title, ...priceRange };
}
function exportJSON(data) {
- return JSON.stringify(data, null, 2);
+ return JSON.stringify(data, null, 2);
}
async function exportCSV(data) {
- const parser = new AsyncParser();
- return await parser.parse(data).promise();
+ const parser = new AsyncParser();
+ return await parser.parse(data).promise();
}
-const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const listingURL = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
const $ = await download(listingURL);
-const data = $(".product-item").toArray().map(element => {
- const $productItem = $(element);
- // highlight-next-line
- const item = parseProduct($productItem, listingURL);
- return item;
-});
+const data = $('.product-item')
+ .toArray()
+ .map((element) => {
+ const $productItem = $(element);
+ // highlight-next-line
+ const item = parseProduct($productItem, listingURL);
+ return item;
+ });
await writeFile('products.json', exportJSON(data));
await writeFile('products.csv', await exportCSV(data));
@@ -90,40 +87,34 @@ Depending on what's valuable for our use case, we can now use the same technique
```html
```
It looks like using a CSS selector to locate the element with the `product-meta__vendor` class, and then extracting its text, should be enough to get the vendor name as a string:
```js
-const vendor = $(".product-meta__vendor").text().trim();
+const vendor = $('.product-meta__vendor').text().trim();
```
But where do we put this line in our program?
@@ -135,15 +126,17 @@ In the `.map()` loop, we're already going through all the products. Let's expand
First, we need to make the loop asynchronous so that we can use `await download()` for each product. We'll add the `async` keyword to the inner function and rename the collection to `promises`, since it will now store promises that resolve to items rather than the items themselves. We'll pass it to `await Promise.all()` to resolve all the promises and retrieve the actual items.
```js
-const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const listingURL = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
const $ = await download(listingURL);
// highlight-next-line
-const promises = $(".product-item").toArray().map(async element => {
- const $productItem = $(element);
- const item = parseProduct($productItem, listingURL);
- return item;
-});
+const promises = $('.product-item')
+ .toArray()
+ .map(async (element) => {
+ const $productItem = $(element);
+ const item = parseProduct($productItem, listingURL);
+ return item;
+ });
// highlight-next-line
const data = await Promise.all(promises);
```
@@ -151,20 +144,22 @@ const data = await Promise.all(promises);
The program behaves the same as before, but now the code is prepared to make HTTP requests from within the inner function. Let's do it:
```js
-const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const listingURL = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
const $ = await download(listingURL);
-const promises = $(".product-item").toArray().map(async element => {
- const $productItem = $(element);
- const item = parseProduct($productItem, listingURL);
+const promises = $('.product-item')
+ .toArray()
+ .map(async (element) => {
+ const $productItem = $(element);
+ const item = parseProduct($productItem, listingURL);
- // highlight-next-line
- const $p = await download(item.url);
- // highlight-next-line
- item.vendor = $p(".product-meta__vendor").text().trim();
+ // highlight-next-line
+ const $p = await download(item.url);
+ // highlight-next-line
+ item.vendor = $p('.product-meta__vendor').text().trim();
- return item;
-});
+ return item;
+ });
const data = await Promise.all($promises.get());
```
@@ -173,6 +168,7 @@ We download each product detail page and parse its HTML using Cheerio. The `$p`
If we run the program now, it'll take longer to finish since it's making 24 more HTTP requests. But in the end, it should produce exports with a new field containing the vendor's name:
+
```json title=products.json
[
{
@@ -237,43 +233,39 @@ Locating cells in tables is sometimes easier if you know how to [filter](https:/
Solution
- ```js
- import * as cheerio from 'cheerio';
+```js
+import * as cheerio from 'cheerio';
- async function download(url) {
+async function download(url) {
const response = await fetch(url);
if (response.ok) {
- const html = await response.text();
- return cheerio.load(html);
+ const html = await response.text();
+ return cheerio.load(html);
} else {
- throw new Error(`HTTP ${response.status}`);
+ throw new Error(`HTTP ${response.status}`);
}
- }
+}
- const listingURL = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa";
- const $ = await download(listingURL);
+const listingURL =
+ 'https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa';
+const $ = await download(listingURL);
- const $cells = $(".wikitable tr td:nth-child(3)");
- const promises = $cells.toArray().map(async element => {
+const $cells = $('.wikitable tr td:nth-child(3)');
+const promises = $cells.toArray().map(async (element) => {
const $nameCell = $(element);
- const $link = $nameCell.find("a").first();
- const countryURL = new URL($link.attr("href"), listingURL).href;
+ const $link = $nameCell.find('a').first();
+ const countryURL = new URL($link.attr('href'), listingURL).href;
const $c = await download(countryURL);
- const $label = $c("th.infobox-label")
- .filter((i, element) => $c(element).text().trim() == "Calling code")
- .first();
- const callingCode = $label
- .parent()
- .find("td.infobox-data")
- .first()
- .text()
- .trim();
+ const $label = $c('th.infobox-label')
+ .filter((i, element) => $c(element).text().trim() == 'Calling code')
+ .first();
+ const callingCode = $label.parent().find('td.infobox-data').first().text().trim();
console.log(`${countryURL} ${callingCode || null}`);
- });
- await Promise.all(promises);
- ```
+});
+await Promise.all(promises);
+```
@@ -306,36 +298,38 @@ PA Media: Lewis Hamilton reveals lifelong battle with depression after school bu
Solution
- ```js
- import * as cheerio from 'cheerio';
+```js
+import * as cheerio from 'cheerio';
- async function download(url) {
+async function download(url) {
const response = await fetch(url);
if (response.ok) {
- const html = await response.text();
- return cheerio.load(html);
+ const html = await response.text();
+ return cheerio.load(html);
} else {
- throw new Error(`HTTP ${response.status}`);
+ throw new Error(`HTTP ${response.status}`);
}
- }
+}
- const listingURL = "https://www.theguardian.com/sport/formulaone";
- const $ = await download(listingURL);
+const listingURL = 'https://www.theguardian.com/sport/formulaone';
+const $ = await download(listingURL);
- const promises = $("#maincontent ul li").toArray().map(async element => {
- const $item = $(element);
- const $link = $item.find("a").first();
- const authorURL = new URL($link.attr("href"), listingURL).href;
+const promises = $('#maincontent ul li')
+ .toArray()
+ .map(async (element) => {
+ const $item = $(element);
+ const $link = $item.find('a').first();
+ const authorURL = new URL($link.attr('href'), listingURL).href;
- const $a = await download(authorURL);
- const title = $a("h1").text().trim();
+ const $a = await download(authorURL);
+ const title = $a('h1').text().trim();
- const author = $a('a[rel="author"]').text().trim();
- const address = $a('aside address').text().trim();
+ const author = $a('a[rel="author"]').text().trim();
+ const address = $a('aside address').text().trim();
- console.log(`${author || address || null}: ${title}`);
- });
- await Promise.all(promises);
- ```
+ console.log(`${author || address || null}: ${title}`);
+ });
+await Promise.all(promises);
+```
diff --git a/sources/academy/webscraping/scraping_basics_javascript/11_scraping_variants.md b/sources/academy/webscraping/scraping_basics_javascript/11_scraping_variants.md
index 2d4044240a..2ce677e301 100644
--- a/sources/academy/webscraping/scraping_basics_javascript/11_scraping_variants.md
+++ b/sources/academy/webscraping/scraping_basics_javascript/11_scraping_variants.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/scraping-variants
---
import LegacyJsCourseAdmonition from '@site/src/components/LegacyJsCourseAdmonition';
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
@@ -22,20 +22,43 @@ First, let's extract information about the variants. If we go to [Sony XBR-950G
```html
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
```
@@ -49,21 +72,23 @@ After a bit of detective work, we notice that not far below the `block-swatch-li
```html
-
-
-
-
+
+
+
+
```
@@ -74,27 +99,29 @@ These elements aren't visible to regular visitors. They're there just in case br
Using our knowledge of Cheerio, we can locate the `option` elements and extract the data we need. We'll loop over the options, extract variant names, and create a corresponding array of items for each product:
```js
-const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const listingURL = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
const $ = await download(listingURL);
-const promises = $(".product-item").toArray().map(async element => {
- const $productItem = $(element);
- const item = parseProduct($productItem, listingURL);
-
- const $p = await download(item.url);
- item.vendor = $p(".product-meta__vendor").text().trim();
-
- // highlight-start
- const $options = $p(".product-form__option.no-js option");
- const items = $options.toArray().map(optionElement => {
- const $option = $(optionElement);
- const variantName = $option.text().trim();
- return { variantName, ...item };
- });
- // highlight-end
-
- return item;
-});
+const promises = $('.product-item')
+ .toArray()
+ .map(async (element) => {
+ const $productItem = $(element);
+ const item = parseProduct($productItem, listingURL);
+
+ const $p = await download(item.url);
+ item.vendor = $p('.product-meta__vendor').text().trim();
+
+ // highlight-start
+ const $options = $p('.product-form__option.no-js option');
+ const items = $options.toArray().map((optionElement) => {
+ const $option = $(optionElement);
+ const variantName = $option.text().trim();
+ return { variantName, ...item };
+ });
+ // highlight-end
+
+ return item;
+ });
const data = await Promise.all(promises);
```
@@ -105,25 +132,27 @@ We loop over the variants using `.map()` method to create an array of item copie
Let's adjust the loop so it returns a promise that resolves to an array of items instead of a single item. If a product has no variants, we'll return an array with a single item, setting `variantName` to `null`:
```js
-const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const listingURL = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
const $ = await download(listingURL);
-const promises = $(".product-item").toArray().map(async element => {
- const $productItem = $(element);
- const item = parseProduct($productItem, listingURL);
-
- const $p = await download(item.url);
- item.vendor = $p(".product-meta__vendor").text().trim();
-
- const $options = $p(".product-form__option.no-js option");
- const items = $options.toArray().map(optionElement => {
- const $option = $(optionElement);
- const variantName = $option.text().trim();
- return { variantName, ...item };
- });
- // highlight-next-line
- return items.length > 0 ? items : [{ variantName: null, ...item }];
-});
+const promises = $('.product-item')
+ .toArray()
+ .map(async (element) => {
+ const $productItem = $(element);
+ const item = parseProduct($productItem, listingURL);
+
+ const $p = await download(item.url);
+ item.vendor = $p('.product-meta__vendor').text().trim();
+
+ const $options = $p('.product-form__option.no-js option');
+ const items = $options.toArray().map((optionElement) => {
+ const $option = $(optionElement);
+ const variantName = $option.text().trim();
+ return { variantName, ...item };
+ });
+ // highlight-next-line
+ return items.length > 0 ? items : [{ variantName: null, ...item }];
+ });
// highlight-start
const itemLists = await Promise.all(promises);
const data = itemLists.flat();
@@ -135,6 +164,7 @@ After modifying the loop, we also updated how we collect the items into the `dat
If we run the program now, we'll see 34 items in total. Some items don't have variants, so they won't have a variant name. However, they should still have a price set—our scraper should already have that info from the product listing page.
+
```json title=products.json
[
...
@@ -153,6 +183,7 @@ If we run the program now, we'll see 34 items in total. Some items don't have va
Some products will break into several items, each with a different variant name. We don't know their exact prices from the product listing, just the min price. In the next step, we should be able to parse the actual price from the variant name for those items.
+
```json title=products.json
[
...
@@ -179,6 +210,7 @@ Some products will break into several items, each with a different variant name.
Perhaps surprisingly, some products with variants will have the price field set. That's because the shop sells all variants of the product for the same price, so the product listing shows the price as a fixed amount, like _$74.95_, instead of _from $74.95_.
+
```json title=products.json
[
...
@@ -200,17 +232,9 @@ The items now contain the variant as text, which is good for a start, but we wan
```js
function parseVariant($option) {
- const [variantName, priceText] = $option
- .text()
- .trim()
- .split(" - ");
- const price = parseInt(
- priceText
- .replace("$", "")
- .replace(".", "")
- .replace(",", "")
- );
- return { variantName, price };
+ const [variantName, priceText] = $option.text().trim().split(' - ');
+ const price = parseInt(priceText.replace('$', '').replace('.', '').replace(',', ''));
+ return { variantName, price };
}
```
@@ -226,83 +250,72 @@ import { writeFile } from 'fs/promises';
import { AsyncParser } from '@json2csv/node';
async function download(url) {
- const response = await fetch(url);
- if (response.ok) {
- const html = await response.text();
- return cheerio.load(html);
- } else {
- throw new Error(`HTTP ${response.status}`);
- }
+ const response = await fetch(url);
+ if (response.ok) {
+ const html = await response.text();
+ return cheerio.load(html);
+ } else {
+ throw new Error(`HTTP ${response.status}`);
+ }
}
function parseProduct($productItem, baseURL) {
- const $title = $productItem.find(".product-item__title");
- const title = $title.text().trim();
- const url = new URL($title.attr("href"), baseURL).href;
-
- const $price = $productItem.find(".price").contents().last();
- const priceRange = { minPrice: null, price: null };
- const priceText = $price
- .text()
- .trim()
- .replace("$", "")
- .replace(".", "")
- .replace(",", "");
-
- if (priceText.startsWith("From ")) {
- priceRange.minPrice = parseInt(priceText.replace("From ", ""));
- } else {
- priceRange.minPrice = parseInt(priceText);
- priceRange.price = priceRange.minPrice;
- }
-
- return { url, title, ...priceRange };
+ const $title = $productItem.find('.product-item__title');
+ const title = $title.text().trim();
+ const url = new URL($title.attr('href'), baseURL).href;
+
+ const $price = $productItem.find('.price').contents().last();
+ const priceRange = { minPrice: null, price: null };
+ const priceText = $price.text().trim().replace('$', '').replace('.', '').replace(',', '');
+
+ if (priceText.startsWith('From ')) {
+ priceRange.minPrice = parseInt(priceText.replace('From ', ''));
+ } else {
+ priceRange.minPrice = parseInt(priceText);
+ priceRange.price = priceRange.minPrice;
+ }
+
+ return { url, title, ...priceRange };
}
async function exportJSON(data) {
- return JSON.stringify(data, null, 2);
+ return JSON.stringify(data, null, 2);
}
async function exportCSV(data) {
- const parser = new AsyncParser();
- return await parser.parse(data).promise();
+ const parser = new AsyncParser();
+ return await parser.parse(data).promise();
}
// highlight-start
function parseVariant($option) {
- const [variantName, priceText] = $option
- .text()
- .trim()
- .split(" - ");
- const price = parseInt(
- priceText
- .replace("$", "")
- .replace(".", "")
- .replace(",", "")
- );
- return { variantName, price };
+ const [variantName, priceText] = $option.text().trim().split(' - ');
+ const price = parseInt(priceText.replace('$', '').replace('.', '').replace(',', ''));
+ return { variantName, price };
}
// highlight-end
-const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales";
+const listingURL = 'https://warehouse-theme-metal.myshopify.com/collections/sales';
const $ = await download(listingURL);
-const promises = $(".product-item").toArray().map(async element => {
- const $productItem = $(element);
- const item = parseProduct($productItem, listingURL);
-
- const $p = await download(item.url);
- item.vendor = $p(".product-meta__vendor").text().trim();
-
- const $options = $p(".product-form__option.no-js option");
- const items = $options.toArray().map(optionElement => {
- // highlight-next-line
- const variant = parseVariant($(optionElement));
- // highlight-next-line
- return { ...item, ...variant };
- });
- return items.length > 0 ? items : [{ variantName: null, ...item }];
-});
+const promises = $('.product-item')
+ .toArray()
+ .map(async (element) => {
+ const $productItem = $(element);
+ const item = parseProduct($productItem, listingURL);
+
+ const $p = await download(item.url);
+ item.vendor = $p('.product-meta__vendor').text().trim();
+
+ const $options = $p('.product-form__option.no-js option');
+ const items = $options.toArray().map((optionElement) => {
+ // highlight-next-line
+ const variant = parseVariant($(optionElement));
+ // highlight-next-line
+ return { ...item, ...variant };
+ });
+ return items.length > 0 ? items : [{ variantName: null, ...item }];
+ });
const itemLists = await Promise.all(promises);
const data = itemLists.flat();
@@ -313,6 +326,7 @@ await writeFile('products.csv', await exportCSV(data));
Let's run the scraper and see if all the items in the data contain prices:
+
```json title=products.json
[
...
@@ -384,67 +398,58 @@ Your output should look something like this:
Solution
- After inspecting the registry, you'll notice that packages with the keyword "LLM" have a dedicated URL. Also, changing the sorting dropdown results in a page with its own URL. We'll use that as our starting point, which saves us from having to scrape the whole registry and then filter by keyword or sort by the number of dependents.
+After inspecting the registry, you'll notice that packages with the keyword "LLM" have a dedicated URL. Also, changing the sorting dropdown results in a page with its own URL. We'll use that as our starting point, which saves us from having to scrape the whole registry and then filter by keyword or sort by the number of dependents.
- ```js
- import * as cheerio from 'cheerio';
+```js
+import * as cheerio from 'cheerio';
- async function download(url) {
+async function download(url) {
const response = await fetch(url);
if (response.ok) {
- const html = await response.text();
- return cheerio.load(html);
+ const html = await response.text();
+ return cheerio.load(html);
} else {
- throw new Error(`HTTP ${response.status}`);
- }
- }
-
- const listingURL = "https://www.npmjs.com/search?page=0&q=keywords%3Allm&sortBy=dependent_count";
- const $ = await download(listingURL);
-
- const promises = $("section").toArray().map(async element => {
- const $card = $(element);
-
- const details = $card
- .children()
- .first()
- .children()
- .last()
- .text()
- .split("•");
- const updatedText = details[2].trim();
- const dependents = parseInt(details[3].replace("dependents", "").trim());
-
- if (updatedText.includes("years ago")) {
- const yearsAgo = parseInt(updatedText.replace("years ago", "").trim());
- if (yearsAgo > 2) {
- return null;
- }
+ throw new Error(`HTTP ${response.status}`);
}
+}
+
+const listingURL = 'https://www.npmjs.com/search?page=0&q=keywords%3Allm&sortBy=dependent_count';
+const $ = await download(listingURL);
- const $link = $card.find("a").first();
- const name = $link.text().trim();
- const url = new URL($link.attr("href"), listingURL).href;
- const description = $card.find("p").text().trim();
+const promises = $('section')
+ .toArray()
+ .map(async (element) => {
+ const $card = $(element);
- const downloadsText = $card
- .children()
- .last()
- .text()
- .replace(",", "")
- .trim();
- const downloads = parseInt(downloadsText);
+ const details = $card.children().first().children().last().text().split('•');
+ const updatedText = details[2].trim();
+ const dependents = parseInt(details[3].replace('dependents', '').trim());
- return { name, url, description, dependents, downloads };
- });
+ if (updatedText.includes('years ago')) {
+ const yearsAgo = parseInt(updatedText.replace('years ago', '').trim());
+ if (yearsAgo > 2) {
+ return null;
+ }
+ }
- const data = await Promise.all(promises);
- console.log(data.filter(item => item !== null).splice(0, 5));
- ```
+ const $link = $card.find('a').first();
+ const name = $link.text().trim();
+ const url = new URL($link.attr('href'), listingURL).href;
+ const description = $card.find('p').text().trim();
- Since the HTML doesn't contain any descriptive classes, we must rely on its structure. We're using [`.children()`](https://cheerio.js.org/docs/api/classes/Cheerio#children) to carefully navigate the HTML element tree.
+ const downloadsText = $card.children().last().text().replace(',', '').trim();
+ const downloads = parseInt(downloadsText);
+
+ return { name, url, description, dependents, downloads };
+ });
+
+const data = await Promise.all(promises);
+console.log(data.filter((item) => item !== null).splice(0, 5));
+```
- For items older than 2 years, we return `null` instead of an item. Before printing the results, we use [.filter()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/filter) to remove these empty values and [.splice()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/splice) the array down to just 5 items.
+Since the HTML doesn't contain any descriptive classes, we must rely on its structure. We're using [`.children()`](https://cheerio.js.org/docs/api/classes/Cheerio#children) to carefully navigate the HTML element tree.
+
+For items older than 2 years, we return `null` instead of an item. Before printing the results, we use [.filter()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/filter) to remove these empty values and [.splice()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/splice) the array down to just 5 items.
@@ -463,38 +468,40 @@ At the time of writing, the shortest article on the CNN Sports homepage is [abou
Solution
- ```js
- import * as cheerio from 'cheerio';
+```js
+import * as cheerio from 'cheerio';
- async function download(url) {
+async function download(url) {
const response = await fetch(url);
if (response.ok) {
- const html = await response.text();
- return cheerio.load(html);
+ const html = await response.text();
+ return cheerio.load(html);
} else {
- throw new Error(`HTTP ${response.status}`);
+ throw new Error(`HTTP ${response.status}`);
}
- }
+}
- const listingURL = "https://edition.cnn.com/sport";
- const $ = await download(listingURL);
+const listingURL = 'https://edition.cnn.com/sport';
+const $ = await download(listingURL);
- const promises = $(".layout__main .card").toArray().map(async element => {
- const $link = $(element).find("a").first();
- const articleURL = new URL($link.attr("href"), listingURL).href;
+const promises = $('.layout__main .card')
+ .toArray()
+ .map(async (element) => {
+ const $link = $(element).find('a').first();
+ const articleURL = new URL($link.attr('href'), listingURL).href;
- const $a = await download(articleURL);
- const content = $a(".article__content").text().trim();
+ const $a = await download(articleURL);
+ const content = $a('.article__content').text().trim();
- return { url: articleURL, length: content.length };
- });
+ return { url: articleURL, length: content.length };
+ });
- const data = await Promise.all(promises);
- const nonZeroData = data.filter(({ url, length }) => length > 0);
- nonZeroData.sort((a, b) => a.length - b.length);
- const shortestItem = nonZeroData[0];
+const data = await Promise.all(promises);
+const nonZeroData = data.filter(({ url, length }) => length > 0);
+nonZeroData.sort((a, b) => a.length - b.length);
+const shortestItem = nonZeroData[0];
- console.log(shortestItem.url);
- ```
+console.log(shortestItem.url);
+```
diff --git a/sources/academy/webscraping/scraping_basics_javascript/12_framework.md b/sources/academy/webscraping/scraping_basics_javascript/12_framework.md
index e4d45aef47..ce42c6231e 100644
--- a/sources/academy/webscraping/scraping_basics_javascript/12_framework.md
+++ b/sources/academy/webscraping/scraping_basics_javascript/12_framework.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/framework
---
import LegacyJsCourseAdmonition from '@site/src/components/LegacyJsCourseAdmonition';
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
@@ -85,11 +85,14 @@ import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
// highlight-start
async requestHandler({ $, log, request, enqueueLinks }) {
- if (request.label === 'DETAIL') {
- log.info(request.url);
- } else {
- await enqueueLinks({ label: 'DETAIL', selector: '.product-list a.product-item__title' });
- }
+ if (request.label === 'DETAIL') {
+ log.info(request.url);
+ } else {
+ await enqueueLinks({
+ label: 'DETAIL',
+ selector: '.product-list a.product-item__title',
+ });
+ }
},
// highlight-end
});
@@ -126,9 +129,12 @@ const crawler = new CheerioCrawler({
title: $('.product-meta__title').text().trim(),
vendor: $('.product-meta__vendor').text().trim(),
};
- log.info("Item scraped", item);
+ log.info('Item scraped', item);
} else {
- await enqueueLinks({ selector: '.product-list a.product-item__title', label: 'DETAIL' });
+ await enqueueLinks({
+ selector: '.product-list a.product-item__title',
+ label: 'DETAIL',
+ });
}
},
});
@@ -143,17 +149,17 @@ const crawler = new CheerioCrawler({
async requestHandler({ $, request, enqueueLinks, log }) {
if (request.label === 'DETAIL') {
// highlight-next-line
- const $price = $(".product-form__info-content .price").contents().last();
+ const $price = $('.product-form__info-content .price').contents().last();
const priceRange = { minPrice: null, price: null };
const priceText = $price
.text()
.trim()
- .replace("$", "")
- .replace(".", "")
- .replace(",", "");
+ .replace('$', '')
+ .replace('.', '')
+ .replace(',', '');
- if (priceText.startsWith("From ")) {
- priceRange.minPrice = parseInt(priceText.replace("From ", ""));
+ if (priceText.startsWith('From ')) {
+ priceRange.minPrice = parseInt(priceText.replace('From ', ''));
} else {
priceRange.minPrice = parseInt(priceText);
priceRange.price = priceRange.minPrice;
@@ -161,13 +167,16 @@ const crawler = new CheerioCrawler({
const item = {
url: request.url,
- title: $(".product-meta__title").text().trim(),
+ title: $('.product-meta__title').text().trim(),
vendor: $('.product-meta__vendor').text().trim(),
...priceRange,
};
- log.info("Item scraped", item);
+ log.info('Item scraped', item);
} else {
- await enqueueLinks({ selector: '.product-list a.product-item__title', label: 'DETAIL' });
+ await enqueueLinks({
+ selector: '.product-list a.product-item__title',
+ label: 'DETAIL',
+ });
}
},
});
@@ -179,33 +188,25 @@ Finally, the variants. We can reuse the `parseVariant()` function as-is. In the
import { CheerioCrawler } from 'crawlee';
function parseVariant($option) {
- const [variantName, priceText] = $option
- .text()
- .trim()
- .split(" - ");
- const price = parseInt(
- priceText
- .replace("$", "")
- .replace(".", "")
- .replace(",", "")
- );
- return { variantName, price };
+ const [variantName, priceText] = $option.text().trim().split(' - ');
+ const price = parseInt(priceText.replace('$', '').replace('.', '').replace(',', ''));
+ return { variantName, price };
}
const crawler = new CheerioCrawler({
async requestHandler({ $, request, enqueueLinks, log }) {
if (request.label === 'DETAIL') {
- const $price = $(".product-form__info-content .price").contents().last();
+ const $price = $('.product-form__info-content .price').contents().last();
const priceRange = { minPrice: null, price: null };
const priceText = $price
.text()
.trim()
- .replace("$", "")
- .replace(".", "")
- .replace(",", "");
+ .replace('$', '')
+ .replace('.', '')
+ .replace(',', '');
- if (priceText.startsWith("From ")) {
- priceRange.minPrice = parseInt(priceText.replace("From ", ""));
+ if (priceText.startsWith('From ')) {
+ priceRange.minPrice = parseInt(priceText.replace('From ', ''));
} else {
priceRange.minPrice = parseInt(priceText);
priceRange.price = priceRange.minPrice;
@@ -213,7 +214,7 @@ const crawler = new CheerioCrawler({
const item = {
url: request.url,
- title: $(".product-meta__title").text().trim(),
+ title: $('.product-meta__title').text().trim(),
vendor: $('.product-meta__vendor').text().trim(),
...priceRange,
// highlight-next-line
@@ -221,18 +222,21 @@ const crawler = new CheerioCrawler({
};
// highlight-start
- const $variants = $(".product-form__option.no-js option");
+ const $variants = $('.product-form__option.no-js option');
if ($variants.length === 0) {
- log.info("Item scraped", item);
+ log.info('Item scraped', item);
} else {
- for (const element of $variants.toArray()) {
- const variant = parseVariant($(element));
- log.info("Item scraped", { ...item, ...variant });
- }
+ for (const element of $variants.toArray()) {
+ const variant = parseVariant($(element));
+ log.info('Item scraped', { ...item, ...variant });
+ }
}
// highlight-end
} else {
- await enqueueLinks({ selector: '.product-list a.product-item__title', label: 'DETAIL' });
+ await enqueueLinks({
+ selector: '.product-list a.product-item__title',
+ label: 'DETAIL',
+ });
}
},
});
@@ -295,17 +299,9 @@ Crawlee gives us stats about HTTP requests and concurrency, but once we started
import { CheerioCrawler } from 'crawlee';
function parseVariant($option) {
- const [variantName, priceText] = $option
- .text()
- .trim()
- .split(" - ");
- const price = parseInt(
- priceText
- .replace("$", "")
- .replace(".", "")
- .replace(",", "")
- );
- return { variantName, price };
+ const [variantName, priceText] = $option.text().trim().split(' - ');
+ const price = parseInt(priceText.replace('$', '').replace('.', '').replace(',', ''));
+ return { variantName, price };
}
const crawler = new CheerioCrawler({
@@ -314,17 +310,17 @@ const crawler = new CheerioCrawler({
// highlight-next-line
log.info(`Product detail page: ${request.url}`);
- const $price = $(".product-form__info-content .price").contents().last();
+ const $price = $('.product-form__info-content .price').contents().last();
const priceRange = { minPrice: null, price: null };
const priceText = $price
.text()
.trim()
- .replace("$", "")
- .replace(".", "")
- .replace(",", "");
+ .replace('$', '')
+ .replace('.', '')
+ .replace(',', '');
- if (priceText.startsWith("From ")) {
- priceRange.minPrice = parseInt(priceText.replace("From ", ""));
+ if (priceText.startsWith('From ')) {
+ priceRange.minPrice = parseInt(priceText.replace('From ', ''));
} else {
priceRange.minPrice = parseInt(priceText);
priceRange.price = priceRange.minPrice;
@@ -332,29 +328,32 @@ const crawler = new CheerioCrawler({
const item = {
url: request.url,
- title: $(".product-meta__title").text().trim(),
+ title: $('.product-meta__title').text().trim(),
vendor: $('.product-meta__vendor').text().trim(),
...priceRange,
variantName: null,
};
- const $variants = $(".product-form__option.no-js option");
+ const $variants = $('.product-form__option.no-js option');
if ($variants.length === 0) {
- // highlight-next-line
- log.info('Saving a product');
- pushData(item);
- } else {
- for (const element of $variants.toArray()) {
- const variant = parseVariant($(element));
// highlight-next-line
- log.info('Saving a product variant');
- pushData({ ...item, ...variant });
- }
+ log.info('Saving a product');
+ pushData(item);
+ } else {
+ for (const element of $variants.toArray()) {
+ const variant = parseVariant($(element));
+ // highlight-next-line
+ log.info('Saving a product variant');
+ pushData({ ...item, ...variant });
+ }
}
} else {
// highlight-next-line
log.info('Looking for product detail pages');
- await enqueueLinks({ selector: '.product-list a.product-item__title', label: 'DETAIL' });
+ await enqueueLinks({
+ selector: '.product-list a.product-item__title',
+ label: 'DETAIL',
+ });
}
},
});
@@ -390,6 +389,7 @@ Scrape information about all [F1 Academy](https://en.wikipedia.org/wiki/F1_Acade
If you export the dataset as JSON, it should look something like this:
+
```json
[
{
@@ -422,42 +422,42 @@ If you export the dataset as JSON, it should look something like this:
Solution
- ```js
- import { CheerioCrawler } from 'crawlee';
-
- const crawler = new CheerioCrawler({
- async requestHandler({ $, request, enqueueLinks, pushData }) {
- if (request.label === 'DRIVER') {
- const info = {};
- for (const itemElement of $('.common-driver-info li').toArray()) {
- const name = $(itemElement).find('span').text().trim();
- const value = $(itemElement).find('h4').text().trim();
- info[name] = value;
- }
- const detail = {};
- for (const linkElement of $('.driver-detail--cta-group a').toArray()) {
- const name = $(linkElement).find('p').text().trim();
- const value = $(linkElement).find('h2').text().trim();
- detail[name] = value;
- });
- const [dobDay, dobMonth, dobYear] = info['DOB'].split("/");
- pushData({
- url: request.url,
- name: $('h1').text().trim(),
- team: detail['Team'],
- nationality: info['Nationality'],
- dob: `${dobYear}-${dobMonth}-${dobDay}`,
- instagram_url: $(".common-social-share a[href*='instagram']").attr('href'),
- });
- } else {
- await enqueueLinks({ selector: '.teams-driver-item a', label: 'DRIVER' });
+```js
+import { CheerioCrawler } from 'crawlee';
+
+const crawler = new CheerioCrawler({
+ async requestHandler({ $, request, enqueueLinks, pushData }) {
+ if (request.label === 'DRIVER') {
+ const info = {};
+ for (const itemElement of $('.common-driver-info li').toArray()) {
+ const name = $(itemElement).find('span').text().trim();
+ const value = $(itemElement).find('h4').text().trim();
+ info[name] = value;
}
- },
- });
+ const detail = {};
+ for (const linkElement of $('.driver-detail--cta-group a').toArray()) {
+ const name = $(linkElement).find('p').text().trim();
+ const value = $(linkElement).find('h2').text().trim();
+ detail[name] = value;
+ });
+ const [dobDay, dobMonth, dobYear] = info['DOB'].split("/");
+ pushData({
+ url: request.url,
+ name: $('h1').text().trim(),
+ team: detail['Team'],
+ nationality: info['Nationality'],
+ dob: `${dobYear}-${dobMonth}-${dobDay}`,
+ instagram_url: $(".common-social-share a[href*='instagram']").attr('href'),
+ });
+ } else {
+ await enqueueLinks({ selector: '.teams-driver-item a', label: 'DRIVER' });
+ }
+ },
+});
- await crawler.run(['https://www.f1academy.com/Racing-Series/Drivers']);
- await crawler.exportData('dataset.json');
- ```
+await crawler.run(['https://www.f1academy.com/Racing-Series/Drivers']);
+await crawler.exportData('dataset.json');
+```
@@ -472,6 +472,7 @@ The [Global Top 10](https://www.netflix.com/tudum/top10) page has a table listin
If you export the dataset as JSON, it should look something like this:
+
```json
[
{
@@ -516,40 +517,49 @@ When navigating to the first IMDb search result, you might find it helpful to kn
Solution
- ```js
- import { CheerioCrawler, Request } from 'crawlee';
- import { escape } from 'node:querystring';
+```js
+import { CheerioCrawler, Request } from 'crawlee';
+import { escape } from 'node:querystring';
- const crawler = new CheerioCrawler({
+const crawler = new CheerioCrawler({
async requestHandler({ $, request, enqueueLinks, pushData, addRequests }) {
- if (request.label === 'IMDB') {
- // handle IMDB film page
- pushData({
- url: request.url,
- title: $('h1').text().trim(),
- rating: $("[data-testid='hero-rating-bar__aggregate-rating__score']").first().text().trim(),
- });
- } else if (request.label === 'IMDB_SEARCH') {
- // handle IMDB search results
- await enqueueLinks({ selector: '.find-result-item a', label: 'IMDB', limit: 1 });
-
- } else if (request.label === 'NETFLIX') {
- // handle Netflix table
- const $buttons = $('[data-uia="top10-table-row-title"] button');
- const requests = $buttons.toArray().map(buttonElement => {
- const name = $(buttonElement).text().trim();
- const imdbSearchUrl = `https://www.imdb.com/find/?q=${escape(name)}&s=tt&ttype=ft`;
- return new Request({ url: imdbSearchUrl, label: 'IMDB_SEARCH' });
- });
- await addRequests($requests.get());
- } else {
- throw new Error(`Unexpected request label: ${request.label}`);
- }
+ if (request.label === 'IMDB') {
+ // handle IMDB film page
+ pushData({
+ url: request.url,
+ title: $('h1').text().trim(),
+ rating: $("[data-testid='hero-rating-bar__aggregate-rating__score']")
+ .first()
+ .text()
+ .trim(),
+ });
+ } else if (request.label === 'IMDB_SEARCH') {
+ // handle IMDB search results
+ await enqueueLinks({
+ selector: '.find-result-item a',
+ label: 'IMDB',
+ limit: 1,
+ });
+ } else if (request.label === 'NETFLIX') {
+ // handle Netflix table
+ const $buttons = $('[data-uia="top10-table-row-title"] button');
+ const requests = $buttons.toArray().map((buttonElement) => {
+ const name = $(buttonElement).text().trim();
+ const imdbSearchUrl = `https://www.imdb.com/find/?q=${escape(name)}&s=tt&ttype=ft`;
+ return new Request({
+ url: imdbSearchUrl,
+ label: 'IMDB_SEARCH',
+ });
+ });
+ await addRequests($requests.get());
+ } else {
+ throw new Error(`Unexpected request label: ${request.label}`);
+ }
},
- });
+});
- await crawler.run(['https://www.netflix.com/tudum/top10']);
- await crawler.exportData('dataset.json');
- ```
+await crawler.run(['https://www.netflix.com/tudum/top10']);
+await crawler.exportData('dataset.json');
+```
diff --git a/sources/academy/webscraping/scraping_basics_javascript/13_platform.md b/sources/academy/webscraping/scraping_basics_javascript/13_platform.md
index e0abb0bb02..c7dc0c9779 100644
--- a/sources/academy/webscraping/scraping_basics_javascript/13_platform.md
+++ b/sources/academy/webscraping/scraping_basics_javascript/13_platform.md
@@ -203,25 +203,25 @@ Proxy configuration is a type of [Actor input](https://docs.apify.com/platform/a
```json title=".actor/inputSchema.json"
{
- "title": "Crawlee Cheerio Scraper",
- "type": "object",
- "schemaVersion": 1,
- "properties": {
- "proxyConfig": {
- "title": "Proxy config",
- "description": "Proxy configuration",
- "type": "object",
- "editor": "proxy",
- "prefill": {
- "useApifyProxy": true,
- "apifyProxyGroups": []
- },
- "default": {
- "useApifyProxy": true,
- "apifyProxyGroups": []
- }
+ "title": "Crawlee Cheerio Scraper",
+ "type": "object",
+ "schemaVersion": 1,
+ "properties": {
+ "proxyConfig": {
+ "title": "Proxy config",
+ "description": "Proxy configuration",
+ "type": "object",
+ "editor": "proxy",
+ "prefill": {
+ "useApifyProxy": true,
+ "apifyProxyGroups": []
+ },
+ "default": {
+ "useApifyProxy": true,
+ "apifyProxyGroups": []
+ }
+ }
}
- }
}
```
@@ -229,13 +229,13 @@ Now let's connect this file to the actor configuration. In `actor.json`, we'll a
```json title=".actor/actor.json"
{
- "actorSpecification": 1,
- "name": "warehouse-watchdog",
- "version": "0.0",
- "buildTag": "latest",
- "environmentVariables": {},
- // highlight-next-line
- "input": "./inputSchema.json"
+ "actorSpecification": 1,
+ "name": "warehouse-watchdog",
+ "version": "0.0",
+ "buildTag": "latest",
+ "environmentVariables": {},
+ // highlight-next-line
+ "input": "./inputSchema.json"
}
```
diff --git a/sources/academy/webscraping/scraping_basics_legacy/best_practices.md b/sources/academy/webscraping/scraping_basics_legacy/best_practices.md
index 04622d3d69..a1b6d86025 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/best_practices.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/best_practices.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/legacy/best-practices
noindex: true
---
-import LegacyAdmonition from '../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../scraping_basics/\_legacy.mdx';
**Understand the standards and best practices that we here at Apify abide by to write readable, scalable, and maintainable code.**
@@ -14,7 +14,7 @@ import LegacyAdmonition from '../scraping_basics/_legacy.mdx';
---
-Every developer has their own style, which evolves as they grow and learn. While one dev might prefer a more [functional](https://en.wikipedia.org/wiki/Functional_programming) style, another might find an [imperative](https://en.wikipedia.org/wiki/Imperative_programming) approach to be more intuitive. We at Apify understand this, and have written this best practices lesson with that in mind.
+Every developer has their own style, which evolves as they grow and learn. While one dev might prefer a more [functional](https://en.wikipedia.org/wiki/Functional_programming) style, another might find an [imperative](https://en.wikipedia.org/wiki/Imperative_programming) approach to be more intuitive. We at Apify understand this, and have written this best practices lesson with that in mind.
The goal of this lesson is not to force you into a specific paradigm or to make you think that you're doing things wrong, but instead to provide you some insight into the standards and best practices that we at Apify follow to ensure readable, maintainable, scalable code.
@@ -40,7 +40,7 @@ If you're writing your scraper in JavaScript, use [ES6](https://www.w3schools.co
### No magic numbers {#no-magic-numbers}
-Avoid using [magic numbers](https://en.wikipedia.org/wiki/Magic_number_(programming)) as much as possible. Either declare them as a **constant** variable in your **constants.js** file, or if they are only used once, add a comment explaining what the number is.
+Avoid using [magic numbers]() as much as possible. Either declare them as a **constant** variable in your **constants.js** file, or if they are only used once, add a comment explaining what the number is.
Don't write code like this:
@@ -75,7 +75,7 @@ Here is an example of an "incorrect" log message:
300 https://example.com/1234 1234
```
-And here is that log message translated into something that makes much more sense to the end user:
+And here is that log message translated into something that makes much more sense to the end user:
```text
Index 1234 --- https://example.com/1234 --- took 300 ms
diff --git a/sources/academy/webscraping/scraping_basics_legacy/challenge/index.md b/sources/academy/webscraping/scraping_basics_legacy/challenge/index.md
index 8410611660..1501d319ab 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/challenge/index.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/challenge/index.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/legacy/challenge
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Test your knowledge acquired in the previous sections of this course by building an Amazon scraper using Crawlee's CheerioCrawler!**
@@ -73,7 +73,6 @@ In the end, we'd like our final output to look something like this:
"...": "..."
}
]
-
```
> The `asin` is the ID of the product, which is data present on the Amazon website.
diff --git a/sources/academy/webscraping/scraping_basics_legacy/challenge/initializing_and_setting_up.md b/sources/academy/webscraping/scraping_basics_legacy/challenge/initializing_and_setting_up.md
index 89e8eaca61..4ebfb9ca7e 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/challenge/initializing_and_setting_up.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/challenge/initializing_and_setting_up.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/legacy/challenge/initializing-and-setting-up
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**When you extract links from a web page, you often end up with a lot of irrelevant URLs. Learn how to filter the links to only keep the ones you need.**
@@ -45,14 +45,16 @@ const crawler = new CheerioCrawler({
});
log.info('Starting the crawl.');
-await crawler.run([{
- // Turn the keyword into a link we can make a request with
- url: `https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=${keyword}`,
- label: 'START',
- userData: {
- keyword,
+await crawler.run([
+ {
+ // Turn the keyword into a link we can make a request with
+ url: `https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=${keyword}`,
+ label: 'START',
+ userData: {
+ keyword,
+ },
},
-}]);
+]);
log.info('Crawl finished.');
```
@@ -71,11 +73,11 @@ Finally, we'll add the following input file to **INPUT.json** in the project's r
```json
{
- "keyword": "iphone"
+ "keyword": "iphone"
}
```
-> This is how we'll be inputting data into our scraper from now on. Don't worry though, from now on, we'll only need to work in the **main.js** and **routes.js** files!
+> This is how we'll be inputting data into our scraper from now on. Don't worry though, from now on, we'll only need to work in the **main.js** and **routes.js** files!
## Next up {#next}
diff --git a/sources/academy/webscraping/scraping_basics_legacy/challenge/modularity.md b/sources/academy/webscraping/scraping_basics_legacy/challenge/modularity.md
index b983a75419..c257792bf2 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/challenge/modularity.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/challenge/modularity.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/legacy/challenge/modularity
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Before you build your first web scraper with Crawlee, it is important to understand the concept of modularity in programming.**
@@ -48,20 +48,22 @@ router.addHandler('START', async ({ $, crawler, request }) => {
// scrape some data from each and to a request
// to the crawler for its page
- await crawler.addRequests([{
- url,
- label: 'PRODUCT',
- userData: {
- // Pass the scraped data about the product to the next
- // request so that it can be used there
- data: {
- title: titleElement.first().text().trim(),
- asin: element.attr('data-asin'),
- itemUrl: url,
- keyword,
+ await crawler.addRequests([
+ {
+ url,
+ label: 'PRODUCT',
+ userData: {
+ // Pass the scraped data about the product to the next
+ // request so that it can be used there
+ data: {
+ title: titleElement.first().text().trim(),
+ asin: element.attr('data-asin'),
+ itemUrl: url,
+ keyword,
+ },
},
},
- }]);
+ ]);
}
});
diff --git a/sources/academy/webscraping/scraping_basics_legacy/challenge/scraping_amazon.md b/sources/academy/webscraping/scraping_basics_legacy/challenge/scraping_amazon.md
index 69902e83b3..7e8c843903 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/challenge/scraping_amazon.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/challenge/scraping_amazon.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/legacy/challenge/scraping-amazon
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Build your first web scraper with Crawlee. Let's extract product information from Amazon to give you an idea of what real-world scraping looks like.**
@@ -30,7 +30,6 @@ router.addHandler(labels.PRODUCT, async ({ $ }) => {
});
```
-
Great! But wait, where do we go from here? We need to go to the offers page next and scrape each offer, but how can we do that? Let's take a small break from writing the scraper and open up [Proxyman](../../../glossary/tools/proxyman.md) to analyze requests which we might be difficult to find in the network tab, then we'll click the button on the product page that loads up all of the product offers:

@@ -66,16 +65,18 @@ router.addHandler(labels.PRODUCT, async ({ $, crawler, request }) => {
const element = $('div#productDescription');
// Add to the request queue
- await crawler.addRequests([{
- url: `${BASE_URL}/gp/aod/ajax/ref=auto_load_aod?asin=${data.asin}&pc=dp`,
- label: labels.OFFERS,
- userData: {
- data: {
- ...data,
- description: element.text().trim(),
+ await crawler.addRequests([
+ {
+ url: `${BASE_URL}/gp/aod/ajax/ref=auto_load_aod?asin=${data.asin}&pc=dp`,
+ label: labels.OFFERS,
+ userData: {
+ data: {
+ ...data,
+ description: element.text().trim(),
+ },
},
},
- }]);
+ ]);
});
```
@@ -95,7 +96,6 @@ router.addHandler(labels.OFFERS, async ({ $, request }) => {
sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(),
offer: element.find('.a-price .a-offscreen').text().trim(),
});
-
}
});
```
diff --git a/sources/academy/webscraping/scraping_basics_legacy/crawling/exporting_data.md b/sources/academy/webscraping/scraping_basics_legacy/crawling/exporting_data.md
index abdef9e7bf..99373e01ee 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/crawling/exporting_data.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/crawling/exporting_data.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/legacy/crawling/exporting-data
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Learn how to export the data you scraped using Crawlee to CSV or JSON.**
@@ -101,10 +101,12 @@ const crawler = new PlaywrightCrawler({
},
});
-await crawler.addRequests([{
- url: 'https://warehouse-theme-metal.myshopify.com/collections/sales',
- label: 'start-url',
-}]);
+await crawler.addRequests([
+ {
+ url: 'https://warehouse-theme-metal.myshopify.com/collections/sales',
+ label: 'start-url',
+ },
+]);
await crawler.run();
await Dataset.exportToCSV('results');
diff --git a/sources/academy/webscraping/scraping_basics_legacy/crawling/filtering_links.md b/sources/academy/webscraping/scraping_basics_legacy/crawling/filtering_links.md
index c75c4bbeb1..4152ce224b 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/crawling/filtering_links.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/crawling/filtering_links.md
@@ -8,7 +8,7 @@ noindex: true
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**When you extract links from a web page, you often end up with a lot of irrelevant URLs. Learn how to filter the links to only keep the ones you need.**
@@ -85,7 +85,6 @@ $('a.product-item__title');
-
When we print all the URLs in the DevTools console, we can see that we've correctly filtered only the product detail page URLs.
```js title=DevTools
@@ -102,7 +101,6 @@ If you try this in Node.js instead of DevTools, you will not get the full URLs,

-
## Filtering with pattern-matching {#pattern-matching-filter}
Another common way to filter links (or any text, really) is by matching patterns with regular expressions.
diff --git a/sources/academy/webscraping/scraping_basics_legacy/crawling/finding_links.md b/sources/academy/webscraping/scraping_basics_legacy/crawling/finding_links.md
index d185e8b01a..a31ce4beca 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/crawling/finding_links.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/crawling/finding_links.md
@@ -7,7 +7,7 @@ noindex: true
---
import Example from '!!raw-loader!roa-loader!./finding_links.js';
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Learn what a link looks like in HTML and how to find and extract their URLs when web scraping using both DevTools and Node.js.**
diff --git a/sources/academy/webscraping/scraping_basics_legacy/crawling/first_crawl.md b/sources/academy/webscraping/scraping_basics_legacy/crawling/first_crawl.md
index 2e87a94b42..91908b825a 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/crawling/first_crawl.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/crawling/first_crawl.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/legacy/crawling/first-crawl
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Learn how to crawl the web using Node.js, Cheerio and an HTTP client. Extract URLs from pages and use them to visit more websites.**
diff --git a/sources/academy/webscraping/scraping_basics_legacy/crawling/headless_browser.md b/sources/academy/webscraping/scraping_basics_legacy/crawling/headless_browser.md
index 7d9c1ad355..02c6cd2dbf 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/crawling/headless_browser.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/crawling/headless_browser.md
@@ -8,7 +8,7 @@ noindex: true
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Learn how to scrape the web with a headless browser using only a few lines of code. Chrome, Firefox, Safari, Edge - all are supported.**
@@ -70,10 +70,12 @@ const crawler = new PlaywrightCrawler({
},
});
-await crawler.addRequests([{
- url: 'https://warehouse-theme-metal.myshopify.com/collections/sales',
- label: 'start-url',
-}]);
+await crawler.addRequests([
+ {
+ url: 'https://warehouse-theme-metal.myshopify.com/collections/sales',
+ label: 'start-url',
+ },
+]);
await crawler.run();
```
@@ -84,7 +86,6 @@ The `parseWithCheerio` function is available even in `CheerioCrawler` and all th
:::
-
When you run the code with `node browser.js`, you'll see a browser window open and then the individual pages getting scraped, each in a new browser tab.
That's it. In 4 lines of code, we transformed our crawler from a static HTTP crawler to a headless browser crawler. The crawler now runs the same as before, but uses a Chromium browser instead of plain HTTP requests. This isn't possible without Crawlee.
@@ -191,10 +192,12 @@ const crawler = new PlaywrightCrawler({
},
});
-await crawler.addRequests([{
- url: 'https://warehouse-theme-metal.myshopify.com/collections/sales',
- label: 'start-url',
-}]);
+await crawler.addRequests([
+ {
+ url: 'https://warehouse-theme-metal.myshopify.com/collections/sales',
+ label: 'start-url',
+ },
+]);
await crawler.run();
```
diff --git a/sources/academy/webscraping/scraping_basics_legacy/crawling/index.md b/sources/academy/webscraping/scraping_basics_legacy/crawling/index.md
index 1f1e07adc6..f9dba86ce1 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/crawling/index.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/crawling/index.md
@@ -7,7 +7,7 @@ slug: /scraping-basics-javascript/legacy/crawling
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Learn how to crawl the web with your scraper. How to extract links and URLs from web pages and how to manage the collected links to visit new pages.**
diff --git a/sources/academy/webscraping/scraping_basics_legacy/crawling/pro_scraping.md b/sources/academy/webscraping/scraping_basics_legacy/crawling/pro_scraping.md
index b969ad79a9..61eabffd53 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/crawling/pro_scraping.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/crawling/pro_scraping.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/legacy/crawling/pro-scraping
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Learn how to build scrapers quicker and get better and more robust results by using Crawlee, an open-source library for scraping in Node.js.**
@@ -89,7 +89,7 @@ You'll see "**Crawlee works!**" printed to the console. If it doesn't work, it m
## Prepare the scraper {#coding-the-scraper}
- `CheerioCrawler` automatically visits URLs, downloads HTML using **Got-Scraping**, and parses it with **Cheerio**. The benefit of this over writing the code yourself is that it automatically handles the URL queue, errors, retries, proxies, parallelizes the downloads, and much more. Overall, it removes the need to write a lot of boilerplate code.
+`CheerioCrawler` automatically visits URLs, downloads HTML using **Got-Scraping**, and parses it with **Cheerio**. The benefit of this over writing the code yourself is that it automatically handles the URL queue, errors, retries, proxies, parallelizes the downloads, and much more. Overall, it removes the need to write a lot of boilerplate code.
To create a crawler with Crawlee, you only need to provide it with a request handler - a function that gets executed for each page it visits.
@@ -121,9 +121,7 @@ const crawler = new CheerioCrawler({
});
// Add the Sales category of Warehouse store to the queue of URLs.
-await crawler.addRequests([
- 'https://warehouse-theme-metal.myshopify.com/collections/sales',
-]);
+await crawler.addRequests(['https://warehouse-theme-metal.myshopify.com/collections/sales']);
await crawler.run();
```
@@ -170,12 +168,14 @@ const crawler = new CheerioCrawler({
// Instead of using a string with URL, we're now
// using a request object to add more options.
-await crawler.addRequests([{
- url: 'https://warehouse-theme-metal.myshopify.com/collections/sales',
- // We label the Request to identify
- // it later in the requestHandler.
- label: 'start-url',
-}]);
+await crawler.addRequests([
+ {
+ url: 'https://warehouse-theme-metal.myshopify.com/collections/sales',
+ // We label the Request to identify
+ // it later in the requestHandler.
+ label: 'start-url',
+ },
+]);
await crawler.run();
```
@@ -226,10 +226,12 @@ const crawler = new CheerioCrawler({
},
});
-await crawler.addRequests([{
- url: 'https://warehouse-theme-metal.myshopify.com/collections/sales',
- label: 'start-url',
-}]);
+await crawler.addRequests([
+ {
+ url: 'https://warehouse-theme-metal.myshopify.com/collections/sales',
+ label: 'start-url',
+ },
+]);
await crawler.run();
```
diff --git a/sources/academy/webscraping/scraping_basics_legacy/crawling/recap_extraction_basics.md b/sources/academy/webscraping/scraping_basics_legacy/crawling/recap_extraction_basics.md
index 79b54786fe..b3f13556dc 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/crawling/recap_extraction_basics.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/crawling/recap_extraction_basics.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/legacy/crawling/recap-extraction-basics
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Review our e-commerce website scraper and refresh our memory about its code and the programming techniques we used to extract and save the data.**
diff --git a/sources/academy/webscraping/scraping_basics_legacy/crawling/relative_urls.md b/sources/academy/webscraping/scraping_basics_legacy/crawling/relative_urls.md
index 290a1cea62..4970cd5562 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/crawling/relative_urls.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/crawling/relative_urls.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/legacy/crawling/relative-urls
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Learn about absolute and relative URLs used on web pages and how to work with them when parsing HTML with Cheerio in your scraper.**
diff --git a/sources/academy/webscraping/scraping_basics_legacy/crawling/scraping_the_data.md b/sources/academy/webscraping/scraping_basics_legacy/crawling/scraping_the_data.md
index 6fe51899f1..03aa58c7aa 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/crawling/scraping_the_data.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/crawling/scraping_the_data.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/legacy/crawling/scraping-the-data
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Learn how to add data extraction logic to your crawler, which will allow you to extract data from all the websites you crawled.**
@@ -26,7 +26,8 @@ Let's start writing a script that extracts data from this single PDP. We can use
import { gotScraping } from 'got-scraping';
import * as cheerio from 'cheerio';
-const productUrl = 'https://warehouse-theme-metal.myshopify.com/products/denon-ah-c720-in-ear-headphones';
+const productUrl =
+ 'https://warehouse-theme-metal.myshopify.com/products/denon-ah-c720-in-ear-headphones';
const response = await gotScraping(productUrl);
const html = response.body;
@@ -97,7 +98,8 @@ Save it into a new file called `product.js` and run it with `node product.js` to
import { gotScraping } from 'got-scraping';
import * as cheerio from 'cheerio';
-const productUrl = 'https://warehouse-theme-metal.myshopify.com/products/denon-ah-c720-in-ear-headphones';
+const productUrl =
+ 'https://warehouse-theme-metal.myshopify.com/products/denon-ah-c720-in-ear-headphones';
const response = await gotScraping(productUrl);
const html = response.body;
diff --git a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/browser_devtools.md b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/browser_devtools.md
index 805d101fe5..c5065c6312 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/browser_devtools.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/browser_devtools.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/legacy/data-extraction/browser-devtools
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Learn about browser DevTools, a valuable tool in the world of web scraping, and how you can use them to extract data from a website.**
diff --git a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/computer_preparation.md b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/computer_preparation.md
index 58c543ab3a..0484ba2577 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/computer_preparation.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/computer_preparation.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/legacy/data-extraction/computer-preparation
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Set up your computer to be able to code scrapers with Node.js and JavaScript. Download Node.js and npm and run a Hello World script.**
diff --git a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/devtools_continued.md b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/devtools_continued.md
index 5fe45db672..83270eb02a 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/devtools_continued.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/devtools_continued.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/legacy/data-extraction/devtools-continued
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Continue learning how to extract data from a website using browser DevTools, CSS selectors, and JavaScript via the DevTools console.**
diff --git a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/index.md b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/index.md
index 58923ff208..860ab0a085 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/index.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/index.md
@@ -7,7 +7,7 @@ slug: /scraping-basics-javascript/legacy/data-extraction
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Learn about HTML, CSS, and JavaScript, the basic building blocks of a website, and how to use them in web scraping and data extraction.**
diff --git a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/node_continued.md b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/node_continued.md
index 49546376a2..6a0b58336c 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/node_continued.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/node_continued.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/legacy/data-extraction/node-continued
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Continue learning how to create a web scraper with Node.js and Cheerio. Learn how to parse HTML and print the results of the data your scraper has collected.**
diff --git a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/node_js_scraper.md b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/node_js_scraper.md
index 4be391c3bf..2adbb56111 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/node_js_scraper.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/node_js_scraper.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/legacy/data-extraction/node-js-scraper
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Learn how to use JavaScript and Node.js to create a web scraper, plus take advantage of the Cheerio and Got-scraping libraries to make your job easier.**
diff --git a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/project_setup.md b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/project_setup.md
index e9856dce84..1acfc8d70a 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/project_setup.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/project_setup.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/legacy/data-extraction/project-setup
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Create a new project with npm and Node.js. Install necessary libraries, and test that everything works before starting the next lesson.**
diff --git a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/save_to_csv.md b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/save_to_csv.md
index acb2bf0536..67c818c1cf 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/save_to_csv.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/save_to_csv.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/legacy/data-extraction/save-to-csv
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Learn how to save the results of your scraper's collected data to a CSV file that can be opened in Excel, Google Sheets, or any other spreadsheets program.**
diff --git a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/using_devtools.md b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/using_devtools.md
index d3d56c1c28..acd8140eb1 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/data_extraction/using_devtools.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/data_extraction/using_devtools.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript/legacy/data-extraction/using-devtools
noindex: true
---
-import LegacyAdmonition from '../../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../../scraping_basics/\_legacy.mdx';
**Learn how to use browser DevTools, CSS selectors, and JavaScript via the DevTools console to extract data from a website.**
diff --git a/sources/academy/webscraping/scraping_basics_legacy/index.md b/sources/academy/webscraping/scraping_basics_legacy/index.md
index 5c52b1ae54..10d9a3ada6 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/index.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/index.md
@@ -8,7 +8,7 @@ slug: /scraping-basics-javascript/legacy
noindex: true
---
-import LegacyAdmonition from '../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../scraping_basics/\_legacy.mdx';
**Learn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.**
@@ -36,10 +36,10 @@ When we set out to create the Academy, we wanted to build a complete guide to we
This is what you'll learn in the **Web scraping basics for JavaScript devs** course:
-* [Web scraping basics for JavaScript devs](./index.md)
- * [Basics of data extraction](./data_extraction/index.md)
- * [Basics of crawling](./crawling/index.md)
- * [Best practices](./best_practices.md)
+- [Web scraping basics for JavaScript devs](./index.md)
+ - [Basics of data extraction](./data_extraction/index.md)
+ - [Basics of crawling](./crawling/index.md)
+ - [Best practices](./best_practices.md)
## Requirements {#requirements}
@@ -55,17 +55,17 @@ Ideally, you should have at least a moderate understanding of the following conc
It is recommended to understand at least the fundamentals of JavaScript and be proficient with Node.js prior to starting this course. If you are not yet comfortable with asynchronous programming (with promises and `async...await`), loops (and the different types of loops in JavaScript), modularity, or working with external packages, we would recommend studying the following resources before coming back and continuing this section:
-* [`async...await` (YouTube)](https://www.youtube.com/watch?v=vn3tm0quoqE&ab_channel=Fireship)
-* [JavaScript loops (MDN)](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Loops_and_iteration)
-* [Modularity in Node.js](https://javascript.plainenglish.io/how-to-use-modular-patterns-in-nodejs-982f0e5c8f6e)
+- [`async...await` (YouTube)](https://www.youtube.com/watch?v=vn3tm0quoqE&ab_channel=Fireship)
+- [JavaScript loops (MDN)](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Loops_and_iteration)
+- [Modularity in Node.js](https://javascript.plainenglish.io/how-to-use-modular-patterns-in-nodejs-982f0e5c8f6e)
### General web development {#general-web-development}
Throughout the next lessons, we will sometimes use certain technologies and terms related to the web without explaining them. This is because their knowledge will be **assumed** (unless we're showing something out of the ordinary).
-* [HTML](https://developer.mozilla.org/en-US/docs/Web/HTML)
-* [HTTP protocol](https://developer.mozilla.org/en-US/docs/Web/HTTP)
-* [DevTools](./data_extraction/browser_devtools.md)
+- [HTML](https://developer.mozilla.org/en-US/docs/Web/HTML)
+- [HTTP protocol](https://developer.mozilla.org/en-US/docs/Web/HTTP)
+- [DevTools](./data_extraction/browser_devtools.md)
### jQuery or Cheerio {#jquery-or-cheerio}
diff --git a/sources/academy/webscraping/scraping_basics_legacy/introduction.md b/sources/academy/webscraping/scraping_basics_legacy/introduction.md
index 68bace5f50..ae4fccfddd 100644
--- a/sources/academy/webscraping/scraping_basics_legacy/introduction.md
+++ b/sources/academy/webscraping/scraping_basics_legacy/introduction.md
@@ -7,7 +7,7 @@ slug: /scraping-basics-javascript/legacy/introduction
noindex: true
---
-import LegacyAdmonition from '../scraping_basics/_legacy.mdx';
+import LegacyAdmonition from '../scraping_basics/\_legacy.mdx';
**Start learning about web scraping, web crawling, data extraction, and popular tools to start developing your own scraper.**
diff --git a/sources/academy/webscraping/scraping_basics_python/01_devtools_inspecting.md b/sources/academy/webscraping/scraping_basics_python/01_devtools_inspecting.md
index 9a06641d28..bca41b5a07 100644
--- a/sources/academy/webscraping/scraping_basics_python/01_devtools_inspecting.md
+++ b/sources/academy/webscraping/scraping_basics_python/01_devtools_inspecting.md
@@ -1,11 +1,11 @@
---
title: Inspecting web pages with browser DevTools
-sidebar_label: "DevTools: Inspecting"
+sidebar_label: 'DevTools: Inspecting'
description: Lesson about using the browser tools for developers to inspect and manipulate the structure of a website.
slug: /scraping-basics-python/devtools-inspecting
---
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
**In this lesson we'll use the browser tools for developers to inspect and manipulate the structure of a website.**
@@ -43,8 +43,8 @@ Think of [HTML](https://developer.mozilla.org/en-US/docs/Learn/HTML) elements as
```html
-
First Level Heading
-
Paragraph with emphasized text.
+
First Level Heading
+
Paragraph with emphasized text.
```
@@ -52,8 +52,8 @@ HTML, a markup language, describes how everything on a page is organized, how el
```css
.heading {
- color: blue;
- text-transform: uppercase;
+ color: blue;
+ text-transform: uppercase;
}
```
@@ -76,9 +76,7 @@ We'll click the icon and hover your cursor over Wikipedia's subtitle, **The Free
The highlighted section should look something like this:
```html
-
- The Free Encyclopedia
-
+ The Free Encyclopedia
```
If we were experienced creators of scrapers, our eyes would immediately spot what's needed to make a program that fetches Wikipedia's subtitle. The program would need to download the page's source code, find a `strong` element with `localized-slogan` in its `class` attribute, and extract its text.
@@ -88,15 +86,11 @@ If we were experienced creators of scrapers, our eyes would immediately spot wha
In HTML, whitespace isn't significant, i.e., it only makes the code readable. The following code snippets are equivalent:
```html
-
- The Free Encyclopedia
-
+ The Free Encyclopedia
```
```html
- The Free
-Encyclopedia
-
+The Free Encyclopedia
```
:::
@@ -154,13 +148,13 @@ You're looking for an [`img`](https://developer.mozilla.org/en-US/docs/Web/HTML/
Solution
- 1. Go to [fifa.com](https://www.fifa.com/).
- 1. Activate the element selection tool.
- 1. Click on the logo.
- 1. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu.
- 1. In the console, type `temp1.src` and hit **Enter**.
+1. Go to [fifa.com](https://www.fifa.com/).
+1. Activate the element selection tool.
+1. Click on the logo.
+1. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu.
+1. In the console, type `temp1.src` and hit **Enter**.
- 
+
@@ -171,12 +165,12 @@ Open a news website, such as [CNN](https://cnn.com). Use the Console to change t
Solution
- 1. Go to [cnn.com](https://cnn.com).
- 1. Activate the element selection tool.
- 1. Click on a heading.
- 1. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu.
- 1. In the console, type `temp1.textContent = 'Something something'` and hit **Enter**.
+1. Go to [cnn.com](https://cnn.com).
+1. Activate the element selection tool.
+1. Click on a heading.
+1. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu.
+1. In the console, type `temp1.textContent = 'Something something'` and hit **Enter**.
- 
+
diff --git a/sources/academy/webscraping/scraping_basics_python/02_devtools_locating_elements.md b/sources/academy/webscraping/scraping_basics_python/02_devtools_locating_elements.md
index 515cf1f5e1..3a2c30d6d0 100644
--- a/sources/academy/webscraping/scraping_basics_python/02_devtools_locating_elements.md
+++ b/sources/academy/webscraping/scraping_basics_python/02_devtools_locating_elements.md
@@ -1,11 +1,11 @@
---
title: Locating HTML elements on a web page with browser DevTools
-sidebar_label: "DevTools: Locating HTML elements"
+sidebar_label: 'DevTools: Locating HTML elements'
description: Lesson about using the browser tools for developers to manually find products on an e-commerce website.
slug: /scraping-basics-python/devtools-locating-elements
---
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
**In this lesson we'll use the browser tools for developers to manually find products on an e-commerce website.**
@@ -46,9 +46,7 @@ At this stage, we could use the **Store as global variable** option to send the
Scrapers typically rely on [CSS selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors) to locate elements on a page, and these selectors often target elements based on their `class` attributes. The product card we highlighted has markup like this:
```html
-
- ...
-
+
...
```
The `class` attribute can hold multiple values separated by whitespace. This particular element has four classes. Let's move to the **Console** and experiment with CSS selectors to locate this element.
@@ -73,9 +71,9 @@ The [type selector](https://developer.mozilla.org/en-US/docs/Web/CSS/Type_select
```html
-
-
Title
-
Paragraph.
+
+
Title
+
Paragraph.
```
@@ -83,14 +81,14 @@ The [class selector](https://developer.mozilla.org/en-US/docs/Web/CSS/Class_sele
```html
-
Title
-
-
Subtitle
-
Paragraph
-
+
Title
- Heading
-
+
Subtitle
+
Paragraph
+
+
+ Heading
+
```
@@ -98,10 +96,10 @@ You can combine selectors to narrow results. For example, `p.lead` matches `p` e
```html
-
-
Lead paragraph.
-
Paragraph
-
Paragraph
+
+
Lead paragraph.
+
Paragraph
+
Paragraph
```
@@ -111,7 +109,7 @@ How did we know `.product-item` selects a product card? By inspecting the markup
Multiple approaches often exist for creating a CSS selector that targets the element we want. We should pick selectors that are simple, readable, unique, and semantically tied to the data. These are **resilient selectors**. They're the most reliable and likely to survive website updates. We better avoid randomly generated attributes like `class="F4jsL8"`, as they tend to change without warning.
-The product card has four classes: `product-item`, `product-item--vertical`, `1/3--tablet-and-up`, and `1/4--desk`. Only the first one checks all the boxes. A product card *is* a product item, after all. The others seem more about styling—defining how the element looks on the screen—and are probably tied to CSS rules.
+The product card has four classes: `product-item`, `product-item--vertical`, `1/3--tablet-and-up`, and `1/4--desk`. Only the first one checks all the boxes. A product card _is_ a product item, after all. The others seem more about styling—defining how the element looks on the screen—and are probably tied to CSS rules.
This class is also unique enough in the page's context. If it were something generic like `item`, there would be a higher risk that developers of the website might use it for unrelated elements. In the **Elements** tab, we can see a parent element `product-list` that contains all the product cards marked as `product-item`. This structure aligns with the data we're after.
@@ -157,12 +155,12 @@ On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use
Solution
- 1. Open the [Main Page](https://en.wikipedia.org/wiki/Main_Page).
- 1. Activate the element selection tool in your DevTools.
- 1. Click on several headings to examine the markup.
- 1. Notice that all headings are `h2` elements with the `mp-h2` class.
- 1. In the **Console**, execute `document.querySelectorAll('h2')`.
- 1. At the time of writing, this selector returns 8 headings. Each corresponds to a box, and there are no other `h2` elements on the page. Thus, the selector is sufficient as is.
+1. Open the [Main Page](https://en.wikipedia.org/wiki/Main_Page).
+1. Activate the element selection tool in your DevTools.
+1. Click on several headings to examine the markup.
+1. Notice that all headings are `h2` elements with the `mp-h2` class.
+1. In the **Console**, execute `document.querySelectorAll('h2')`.
+1. At the time of writing, this selector returns 8 headings. Each corresponds to a box, and there are no other `h2` elements on the page. Thus, the selector is sufficient as is.
@@ -175,13 +173,13 @@ Go to Shein's [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewel
Solution
- 1. Visit the [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) page. Close any pop-ups or promotions.
- 1. Activate the element selection tool in your DevTools.
- 1. Click on the first product to inspect its markup. Repeat with a few others.
- 1. Observe that all products are `section` elements with multiple classes, including `product-card`.
- 1. Since `section` is a generic wrapper, focus on the `product-card` class.
- 1. In the **Console**, execute `document.querySelectorAll('.product-card')`.
- 1. At the time of writing, this selector returns 120 results, all representing products. No further narrowing is necessary.
+1. Visit the [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) page. Close any pop-ups or promotions.
+1. Activate the element selection tool in your DevTools.
+1. Click on the first product to inspect its markup. Repeat with a few others.
+1. Observe that all products are `section` elements with multiple classes, including `product-card`.
+1. Since `section` is a generic wrapper, focus on the `product-card` class.
+1. In the **Console**, execute `document.querySelectorAll('.product-card')`.
+1. At the time of writing, this selector returns 120 results, all representing products. No further narrowing is necessary.
@@ -200,13 +198,13 @@ Learn about the [descendant combinator](https://developer.mozilla.org/en-US/docs
Solution
- 1. Open the [page about F1](https://www.theguardian.com/sport/formulaone).
- 1. Activate the element selection tool in your DevTools.
- 1. Click on an article to inspect its structure. Check several articles, including the ones with smaller cards.
- 1. Note that all articles are `li` elements, but their classes (e.g., `dcr-1qmyfxi`) are dynamically generated and unreliable.
- 1. Using `document.querySelectorAll('li')` returns too many results, including unrelated items like navigation links.
- 1. Inspect the page structure. The `main` element contains the primary content, including articles. Use the descendant combinator to target `li` elements within `main`.
- 1. In the **Console**, execute `document.querySelectorAll('main li')`.
- 1. At the time of writing, this selector returns 21 results. All appear to represent articles, so the solution works!
+1. Open the [page about F1](https://www.theguardian.com/sport/formulaone).
+1. Activate the element selection tool in your DevTools.
+1. Click on an article to inspect its structure. Check several articles, including the ones with smaller cards.
+1. Note that all articles are `li` elements, but their classes (e.g., `dcr-1qmyfxi`) are dynamically generated and unreliable.
+1. Using `document.querySelectorAll('li')` returns too many results, including unrelated items like navigation links.
+1. Inspect the page structure. The `main` element contains the primary content, including articles. Use the descendant combinator to target `li` elements within `main`.
+1. In the **Console**, execute `document.querySelectorAll('main li')`.
+1. At the time of writing, this selector returns 21 results. All appear to represent articles, so the solution works!
diff --git a/sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md b/sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md
index f864362f8a..f763b6eed0 100644
--- a/sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md
+++ b/sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md
@@ -1,11 +1,11 @@
---
title: Extracting data from a web page with browser DevTools
-sidebar_label: "DevTools: Extracting data"
+sidebar_label: 'DevTools: Extracting data'
description: Lesson about using the browser tools for developers to manually extract product data from an e-commerce website.
slug: /scraping-basics-python/devtools-extracting-data
---
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
**In this lesson we'll use the browser tools for developers to manually extract product data from an e-commerce website.**
@@ -83,15 +83,15 @@ At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/a
Solution
- 1. Open the [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/).
- 1. Sort the products by price, from high to low, so the most expensive plant appears first in the listing.
- 1. Activate the element selection tool in your DevTools.
- 1. Click on the price of the first and most expensive plant.
- 1. Notice that the price is structured into two elements, with the integer separated from the currency, under a class named `plp-price__integer`. This structure is convenient for extracting the value.
- 1. In the **Console**, execute `document.querySelector('.plp-price__integer')`. This returns the element representing the first price in the listing. Since `document.querySelector()` returns the first matching element, it directly selects the most expensive plant's price.
- 1. Save the element in a variable by executing `price = document.querySelector('.plp-price__integer')`.
- 1. Convert the price text into a number by executing `parseInt(price.textContent)`.
- 1. At the time of writing, this returns `699`, meaning [699 SEK](https://www.google.com/search?q=699%20sek).
+1. Open the [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/).
+1. Sort the products by price, from high to low, so the most expensive plant appears first in the listing.
+1. Activate the element selection tool in your DevTools.
+1. Click on the price of the first and most expensive plant.
+1. Notice that the price is structured into two elements, with the integer separated from the currency, under a class named `plp-price__integer`. This structure is convenient for extracting the value.
+1. In the **Console**, execute `document.querySelector('.plp-price__integer')`. This returns the element representing the first price in the listing. Since `document.querySelector()` returns the first matching element, it directly selects the most expensive plant's price.
+1. Save the element in a variable by executing `price = document.querySelector('.plp-price__integer')`.
+1. Convert the price text into a number by executing `parseInt(price.textContent)`.
+1. At the time of writing, this returns `699`, meaning [699 SEK](https://www.google.com/search?q=699%20sek).
@@ -104,13 +104,13 @@ On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selecto
Solution
- 1. Open the [Movies page](https://www.fandom.com/topics/movies).
- 1. Activate the element selection tool in your DevTools.
- 1. Click on the list item for the top Fandom wiki in the category.
- 1. Notice that it has a class `topic_explore-wikis__link`.
- 1. In the **Console**, execute `document.querySelector('.topic_explore-wikis__link')`. This returns the element representing the top list item. They use the selector only for the **Top Wikis** list, and because `document.querySelector()` returns the first matching element, you're almost done.
- 1. Save the element in a variable by executing `item = document.querySelector('.topic_explore-wikis__link')`.
- 1. Get the element's text without extra white space by executing `item.textContent.trim()`. At the time of writing, this returns `"Pixar Wiki"`.
+1. Open the [Movies page](https://www.fandom.com/topics/movies).
+1. Activate the element selection tool in your DevTools.
+1. Click on the list item for the top Fandom wiki in the category.
+1. Notice that it has a class `topic_explore-wikis__link`.
+1. In the **Console**, execute `document.querySelector('.topic_explore-wikis__link')`. This returns the element representing the top list item. They use the selector only for the **Top Wikis** list, and because `document.querySelector()` returns the first matching element, you're almost done.
+1. Save the element in a variable by executing `item = document.querySelector('.topic_explore-wikis__link')`.
+1. Get the element's text without extra white space by executing `item.textContent.trim()`. At the time of writing, this returns `"Pixar Wiki"`.
@@ -123,13 +123,13 @@ On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone),
Solution
- 1. Open the [F1 news page](https://www.theguardian.com/sport/formulaone).
- 1. Activate the element selection tool in your DevTools.
- 1. Click on the first post.
- 1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tag names and randomized classes, requiring you to rely on the element hierarchy and order instead.
- 1. In the **Console**, execute `post = document.querySelector('#maincontent ul li')`. This returns the element representing the first post.
- 1. Extract the post's title by executing `post.querySelector('h3').textContent`.
- 1. Extract the lead paragraph by executing `post.querySelector('span div').textContent`.
- 1. Extract the photo URL by executing `post.querySelector('img').src`.
+1. Open the [F1 news page](https://www.theguardian.com/sport/formulaone).
+1. Activate the element selection tool in your DevTools.
+1. Click on the first post.
+1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tag names and randomized classes, requiring you to rely on the element hierarchy and order instead.
+1. In the **Console**, execute `post = document.querySelector('#maincontent ul li')`. This returns the element representing the first post.
+1. Extract the post's title by executing `post.querySelector('h3').textContent`.
+1. Extract the lead paragraph by executing `post.querySelector('span div').textContent`.
+1. Extract the photo URL by executing `post.querySelector('img').src`.
diff --git a/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md b/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md
index e3866cfcb2..04a0d6bb99 100644
--- a/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md
+++ b/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md
@@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi
slug: /scraping-basics-python/downloading-html
---
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
**In this lesson we'll start building a Python application for watching prices. As a first step, we'll use the HTTPX library to download HTML code of a product listing page.**
@@ -150,14 +150,14 @@ https://www.aliexpress.com/w/wholesale-darth-vader.html
Solution
- ```py
- import httpx
+```py
+import httpx
- url = "https://www.aliexpress.com/w/wholesale-darth-vader.html"
- response = httpx.get(url)
- response.raise_for_status()
- print(response.text)
- ```
+url = "https://www.aliexpress.com/w/wholesale-darth-vader.html"
+response = httpx.get(url)
+response.raise_for_status()
+print(response.text)
+```
@@ -172,23 +172,23 @@ https://warehouse-theme-metal.myshopify.com/collections/sales
Solution
- Right in your Terminal or Command Prompt, you can create files by _redirecting output_ of command line programs:
+Right in your Terminal or Command Prompt, you can create files by _redirecting output_ of command line programs:
- ```text
- python main.py > products.html
- ```
+```text
+python main.py > products.html
+```
- If you want to use Python instead, it offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
+If you want to use Python instead, it offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
- ```py
- import httpx
- from pathlib import Path
+```py
+import httpx
+from pathlib import Path
- url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
- response = httpx.get(url)
- response.raise_for_status()
- Path("products.html").write_text(response.text)
- ```
+url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
+response = httpx.get(url)
+response.raise_for_status()
+Path("products.html").write_text(response.text)
+```
@@ -203,16 +203,16 @@ https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72
Solution
- Python offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
+Python offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html):
- ```py
- from pathlib import Path
- import httpx
+```py
+from pathlib import Path
+import httpx
- url = "https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg"
- response = httpx.get(url)
- response.raise_for_status()
- Path("tv.jpg").write_bytes(response.content)
- ```
+url = "https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg"
+response = httpx.get(url)
+response.raise_for_status()
+Path("tv.jpg").write_bytes(response.content)
+```
diff --git a/sources/academy/webscraping/scraping_basics_python/05_parsing_html.md b/sources/academy/webscraping/scraping_basics_python/05_parsing_html.md
index dbfa52cb9a..080b85fa2d 100644
--- a/sources/academy/webscraping/scraping_basics_python/05_parsing_html.md
+++ b/sources/academy/webscraping/scraping_basics_python/05_parsing_html.md
@@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi
slug: /scraping-basics-python/parsing-html
---
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
**In this lesson we'll look for products in the downloaded HTML. We'll use BeautifulSoup to turn the HTML into objects which we can work with in our Python program.**
@@ -131,18 +131,18 @@ https://www.f1academy.com/Racing-Series/Teams
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
+```py
+import httpx
+from bs4 import BeautifulSoup
- url = "https://www.f1academy.com/Racing-Series/Teams"
- response = httpx.get(url)
- response.raise_for_status()
+url = "https://www.f1academy.com/Racing-Series/Teams"
+response = httpx.get(url)
+response.raise_for_status()
- html_code = response.text
- soup = BeautifulSoup(html_code, "html.parser")
- print(len(soup.select(".teams-driver-item")))
- ```
+html_code = response.text
+soup = BeautifulSoup(html_code, "html.parser")
+print(len(soup.select(".teams-driver-item")))
+```
@@ -153,17 +153,17 @@ Use the same URL as in the previous exercise, but this time print a total count
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
+```py
+import httpx
+from bs4 import BeautifulSoup
- url = "https://www.f1academy.com/Racing-Series/Teams"
- response = httpx.get(url)
- response.raise_for_status()
+url = "https://www.f1academy.com/Racing-Series/Teams"
+response = httpx.get(url)
+response.raise_for_status()
- html_code = response.text
- soup = BeautifulSoup(html_code, "html.parser")
- print(len(soup.select(".driver")))
- ```
+html_code = response.text
+soup = BeautifulSoup(html_code, "html.parser")
+print(len(soup.select(".driver")))
+```
diff --git a/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md b/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md
index 0708dc071e..f5cf5652b0 100644
--- a/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md
+++ b/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md
@@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi
slug: /scraping-basics-python/locating-elements
---
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
**In this lesson we'll locate product data in the downloaded HTML. We'll use BeautifulSoup to find those HTML elements which contain details about each product, such as title or price.**
@@ -145,8 +145,8 @@ For each product, our scraper also prints the text `Sale price`. Let's look at t
```html
- Sale price
- $74.95
+ Sale price
+ $74.95
```
@@ -244,27 +244,27 @@ Djibouti
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
+```py
+import httpx
+from bs4 import BeautifulSoup
- url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"
- response = httpx.get(url)
- response.raise_for_status()
+url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"
+response = httpx.get(url)
+response.raise_for_status()
- html_code = response.text
- soup = BeautifulSoup(html_code, "html.parser")
+html_code = response.text
+soup = BeautifulSoup(html_code, "html.parser")
- for table in soup.select(".wikitable"):
- for row in table.select("tr"):
- cells = row.select("td")
- if cells:
- third_column = cells[2]
- title_link = third_column.select_one("a")
- print(title_link.text)
- ```
+for table in soup.select(".wikitable"):
+ for row in table.select("tr"):
+ cells = row.select("td")
+ if cells:
+ third_column = cells[2]
+ title_link = third_column.select_one("a")
+ print(title_link.text)
+```
- Because some rows contain [table headers](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/th), we skip processing a row if `table_row.select("td")` doesn't find any [table data](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/td) cells.
+Because some rows contain [table headers](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/th), we skip processing a row if `table_row.select("td")` doesn't find any [table data](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/td) cells.
@@ -284,20 +284,20 @@ You may want to check out the following pages:
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
+```py
+import httpx
+from bs4 import BeautifulSoup
- url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"
- response = httpx.get(url)
- response.raise_for_status()
+url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"
+response = httpx.get(url)
+response.raise_for_status()
- html_code = response.text
- soup = BeautifulSoup(html_code, "html.parser")
+html_code = response.text
+soup = BeautifulSoup(html_code, "html.parser")
- for name_cell in soup.select(".wikitable tr td:nth-child(3)"):
- print(name_cell.select_one("a").text)
- ```
+for name_cell in soup.select(".wikitable tr td:nth-child(3)"):
+ print(name_cell.select_one("a").text)
+```
@@ -321,19 +321,19 @@ Max Verstappen wins Canadian Grand Prix: F1 – as it happened
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
+```py
+import httpx
+from bs4 import BeautifulSoup
- url = "https://www.theguardian.com/sport/formulaone"
- response = httpx.get(url)
- response.raise_for_status()
+url = "https://www.theguardian.com/sport/formulaone"
+response = httpx.get(url)
+response.raise_for_status()
- html_code = response.text
- soup = BeautifulSoup(html_code, "html.parser")
+html_code = response.text
+soup = BeautifulSoup(html_code, "html.parser")
- for title in soup.select("#maincontent ul li h3"):
- print(title.text)
- ```
+for title in soup.select("#maincontent ul li h3"):
+ print(title.text)
+```
diff --git a/sources/academy/webscraping/scraping_basics_python/07_extracting_data.md b/sources/academy/webscraping/scraping_basics_python/07_extracting_data.md
index eb49b7ce69..668be21893 100644
--- a/sources/academy/webscraping/scraping_basics_python/07_extracting_data.md
+++ b/sources/academy/webscraping/scraping_basics_python/07_extracting_data.md
@@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi
slug: /scraping-basics-python/extracting-data
---
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
**In this lesson we'll finish extracting product data from the downloaded HTML. With help of basic string manipulation we'll focus on cleaning and correctly representing the product price.**
@@ -86,7 +86,7 @@ for product in soup.select(".product-item"):
## Removing white space
-Often, the strings we extract from a web page start or end with some amount of whitespace, typically space characters or newline characters, which come from the [indentation](https://en.wikipedia.org/wiki/Indentation_(typesetting)#Indentation_in_programming) of the HTML tags.
+Often, the strings we extract from a web page start or end with some amount of whitespace, typically space characters or newline characters, which come from the [indentation]() of the HTML tags.
We call the operation of removing whitespace _stripping_ or _trimming_, and it's so useful in many applications that programming languages and libraries include ready-made tools for it. Let's add Python's built-in [.strip()](https://docs.python.org/3/library/stdtypes.html#str.strip):
@@ -241,37 +241,37 @@ Denon AH-C720 In-Ear Headphones | 236
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
-
- url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
- response = httpx.get(url)
- response.raise_for_status()
-
- html_code = response.text
- soup = BeautifulSoup(html_code, "html.parser")
-
- for product in soup.select(".product-item"):
- title = product.select_one(".product-item__title").text.strip()
-
- units_text = (
- product
- .select_one(".product-item__inventory")
- .text
- .removeprefix("In stock,")
- .removeprefix("Only")
- .removesuffix(" left")
- .removesuffix("units")
- .strip()
- )
- if "Sold out" in units_text:
- units = 0
- else:
- units = int(units_text)
-
- print(title, units, sep=" | ")
- ```
+```py
+import httpx
+from bs4 import BeautifulSoup
+
+url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
+response = httpx.get(url)
+response.raise_for_status()
+
+html_code = response.text
+soup = BeautifulSoup(html_code, "html.parser")
+
+for product in soup.select(".product-item"):
+ title = product.select_one(".product-item__title").text.strip()
+
+ units_text = (
+ product
+ .select_one(".product-item__inventory")
+ .text
+ .removeprefix("In stock,")
+ .removeprefix("Only")
+ .removesuffix(" left")
+ .removesuffix("units")
+ .strip()
+ )
+ if "Sold out" in units_text:
+ units = 0
+ else:
+ units = int(units_text)
+
+ print(title, units, sep=" | ")
+```
@@ -282,29 +282,29 @@ Simplify the code from previous exercise. Use [regular expressions](https://docs
Solution
- ```py
- import re
- import httpx
- from bs4 import BeautifulSoup
+```py
+import re
+import httpx
+from bs4 import BeautifulSoup
- url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
- response = httpx.get(url)
- response.raise_for_status()
+url = "https://warehouse-theme-metal.myshopify.com/collections/sales"
+response = httpx.get(url)
+response.raise_for_status()
- html_code = response.text
- soup = BeautifulSoup(html_code, "html.parser")
+html_code = response.text
+soup = BeautifulSoup(html_code, "html.parser")
- for product in soup.select(".product-item"):
- title = product.select_one(".product-item__title").text.strip()
+for product in soup.select(".product-item"):
+ title = product.select_one(".product-item__title").text.strip()
- units_text = product.select_one(".product-item__inventory").text
- if re_match := re.search(r"\d+", units_text):
- units = int(re_match.group())
- else:
- units = 0
+ units_text = product.select_one(".product-item__inventory").text
+ if re_match := re.search(r"\d+", units_text):
+ units = int(re_match.group())
+ else:
+ units = 0
- print(title, units, sep=" | ")
- ```
+ print(title, units, sep=" | ")
+```
@@ -338,25 +338,25 @@ Hamilton reveals distress over ‘devastating’ groundhog accident at Canadian
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
- from datetime import datetime
+```py
+import httpx
+from bs4 import BeautifulSoup
+from datetime import datetime
- url = "https://www.theguardian.com/sport/formulaone"
- response = httpx.get(url)
- response.raise_for_status()
+url = "https://www.theguardian.com/sport/formulaone"
+response = httpx.get(url)
+response.raise_for_status()
- html_code = response.text
- soup = BeautifulSoup(html_code, "html.parser")
+html_code = response.text
+soup = BeautifulSoup(html_code, "html.parser")
- for article in soup.select("#maincontent ul li"):
- title = article.select_one("h3").text.strip()
+for article in soup.select("#maincontent ul li"):
+ title = article.select_one("h3").text.strip()
- date_iso = article.select_one("time")["datetime"].strip()
- date = datetime.fromisoformat(date_iso)
+ date_iso = article.select_one("time")["datetime"].strip()
+ date = datetime.fromisoformat(date_iso)
- print(title, date.strftime('%a %b %d %Y'), sep=" | ")
- ```
+ print(title, date.strftime('%a %b %d %Y'), sep=" | ")
+```
diff --git a/sources/academy/webscraping/scraping_basics_python/08_saving_data.md b/sources/academy/webscraping/scraping_basics_python/08_saving_data.md
index a0d6d94743..45ae029eee 100644
--- a/sources/academy/webscraping/scraping_basics_python/08_saving_data.md
+++ b/sources/academy/webscraping/scraping_basics_python/08_saving_data.md
@@ -101,6 +101,7 @@ with open("products.json", "w") as file:
That's it! If we run our scraper now, it won't display any output, but it will create a `products.json` file in the current working directory, which contains all the data about the listed products:
+
```json title=products.json
[{"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "min_price": "7495", "price": "7495"}, {"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", "min_price": "139800", "price": null}, ...]
```
@@ -108,7 +109,11 @@ That's it! If we run our scraper now, it won't display any output, but it will c
If you skim through the data, you'll notice that the `json.dump()` function handled some potential issues, such as escaping double quotes found in one of the titles by adding a backslash:
```json
-{"title": "Sony SACS9 10\" Active Subwoofer", "min_price": "15800", "price": "15800"}
+{
+ "title": "Sony SACS9 10\" Active Subwoofer",
+ "min_price": "15800",
+ "price": "15800"
+}
```
:::tip Pretty JSON
@@ -191,17 +196,17 @@ Write a new Python program that reads the `products.json` file we created in thi
Solution
- ```py
- import json
- from pprint import pp
+```py
+import json
+from pprint import pp
- with open("products.json", "r") as file:
- products = json.load(file)
+with open("products.json", "r") as file:
+ products = json.load(file)
- for product in products:
- if int(product["min_price"]) > 500:
- pp(product)
- ```
+for product in products:
+ if int(product["min_price"]) > 500:
+ pp(product)
+```
@@ -212,12 +217,12 @@ Open the `products.csv` file we created in the lesson using a spreadsheet applic
Solution
- Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account:
+Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account:
- 1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data.
- 1. Select the header row. Go to **Data > Create filter**.
- 1. Use the filter icon that appears next to `min_price`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.
+1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data.
+1. Select the header row. Go to **Data > Create filter**.
+1. Use the filter icon that appears next to `min_price`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.
- 
+
diff --git a/sources/academy/webscraping/scraping_basics_python/09_getting_links.md b/sources/academy/webscraping/scraping_basics_python/09_getting_links.md
index 883ba050f3..02ddd28c6c 100644
--- a/sources/academy/webscraping/scraping_basics_python/09_getting_links.md
+++ b/sources/academy/webscraping/scraping_basics_python/09_getting_links.md
@@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi
slug: /scraping-basics-python/getting-links
---
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
**In this lesson, we'll locate and extract links to individual product pages. We'll use BeautifulSoup to find the relevant bits of HTML.**
@@ -236,6 +236,7 @@ def parse_product(product):
In the previous code example, we've also added the URL to the dictionary returned by the function. If we run the scraper now, it should produce exports where each product contains a link to its product page:
+
```json title=products.json
[
{
@@ -300,6 +301,7 @@ for product in listing_soup.select(".product-item"):
When we run the scraper now, we should see full URLs in our exports:
+
```json title=products.json
[
{
@@ -345,23 +347,23 @@ https://en.wikipedia.org/wiki/Botswana
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
- from urllib.parse import urljoin
+```py
+import httpx
+from bs4 import BeautifulSoup
+from urllib.parse import urljoin
- listing_url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"
- response = httpx.get(listing_url)
- response.raise_for_status()
+listing_url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"
+response = httpx.get(listing_url)
+response.raise_for_status()
- html_code = response.text
- soup = BeautifulSoup(html_code, "html.parser")
+html_code = response.text
+soup = BeautifulSoup(html_code, "html.parser")
- for name_cell in soup.select(".wikitable tr td:nth-child(3)"):
- link = name_cell.select_one("a")
- url = urljoin(listing_url, link["href"])
- print(url)
- ```
+for name_cell in soup.select(".wikitable tr td:nth-child(3)"):
+ link = name_cell.select_one("a")
+ url = urljoin(listing_url, link["href"])
+ print(url)
+```
@@ -386,29 +388,29 @@ https://www.theguardian.com/sport/article/2024/sep/02/max-verstappen-damns-his-u
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
- from urllib.parse import urljoin
+```py
+import httpx
+from bs4 import BeautifulSoup
+from urllib.parse import urljoin
- listing_url = "https://www.theguardian.com/sport/formulaone"
- response = httpx.get(listing_url)
- response.raise_for_status()
+listing_url = "https://www.theguardian.com/sport/formulaone"
+response = httpx.get(listing_url)
+response.raise_for_status()
- html_code = response.text
- soup = BeautifulSoup(html_code, "html.parser")
+html_code = response.text
+soup = BeautifulSoup(html_code, "html.parser")
- for item in soup.select("#maincontent ul li"):
- link = item.select_one("a")
- url = urljoin(listing_url, link["href"])
- print(url)
- ```
+for item in soup.select("#maincontent ul li"):
+ link = item.select_one("a")
+ url = urljoin(listing_url, link["href"])
+ print(url)
+```
- Note that some cards contain two links. One leads to the article, and one to the comments. If we selected all the links in the list by `#maincontent ul li a`, we would get incorrect output like this:
+Note that some cards contain two links. One leads to the article, and one to the comments. If we selected all the links in the list by `#maincontent ul li a`, we would get incorrect output like this:
- ```text
- https://www.theguardian.com/sport/article/2024/sep/02/example
- https://www.theguardian.com/sport/article/2024/sep/02/example#comments
- ```
+```text
+https://www.theguardian.com/sport/article/2024/sep/02/example
+https://www.theguardian.com/sport/article/2024/sep/02/example#comments
+```
diff --git a/sources/academy/webscraping/scraping_basics_python/10_crawling.md b/sources/academy/webscraping/scraping_basics_python/10_crawling.md
index 836dadad3a..8164e943b4 100644
--- a/sources/academy/webscraping/scraping_basics_python/10_crawling.md
+++ b/sources/academy/webscraping/scraping_basics_python/10_crawling.md
@@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi
slug: /scraping-basics-python/crawling
---
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
**In this lesson, we'll follow links to individual product pages. We'll use HTTPX to download them and BeautifulSoup to process them.**
@@ -87,33 +87,27 @@ Depending on what's valuable for our use case, we can now use the same technique
```html
```
@@ -146,6 +140,7 @@ for product in listing_soup.select(".product-item"):
If we run the program now, it'll take longer to finish since it's making 24 more HTTP requests. But in the end, it should produce exports with a new field containing the vendor's name:
+
```json title=products.json
[
{
@@ -210,32 +205,32 @@ Locating cells in tables is sometimes easier if you know how to [navigate up](ht
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
- from urllib.parse import urljoin
-
- def download(url):
- response = httpx.get(url)
- response.raise_for_status()
- return BeautifulSoup(response.text, "html.parser")
-
- def parse_calling_code(soup):
- for label in soup.select("th.infobox-label"):
- if label.text.strip() == "Calling code":
- data = label.parent.select_one("td.infobox-data")
- return data.text.strip()
- return None
-
- listing_url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"
- listing_soup = download(listing_url)
- for name_cell in listing_soup.select(".wikitable tr td:nth-child(3)"):
- link = name_cell.select_one("a")
- country_url = urljoin(listing_url, link["href"])
- country_soup = download(country_url)
- calling_code = parse_calling_code(country_soup)
- print(country_url, calling_code)
- ```
+```py
+import httpx
+from bs4 import BeautifulSoup
+from urllib.parse import urljoin
+
+def download(url):
+ response = httpx.get(url)
+ response.raise_for_status()
+ return BeautifulSoup(response.text, "html.parser")
+
+def parse_calling_code(soup):
+ for label in soup.select("th.infobox-label"):
+ if label.text.strip() == "Calling code":
+ data = label.parent.select_one("td.infobox-data")
+ return data.text.strip()
+ return None
+
+listing_url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"
+listing_soup = download(listing_url)
+for name_cell in listing_soup.select(".wikitable tr td:nth-child(3)"):
+ link = name_cell.select_one("a")
+ country_url = urljoin(listing_url, link["href"])
+ country_soup = download(country_url)
+ calling_code = parse_calling_code(country_soup)
+ print(country_url, calling_code)
+```
@@ -268,34 +263,34 @@ PA Media: Lewis Hamilton reveals lifelong battle with depression after school bu
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
- from urllib.parse import urljoin
-
- def download(url):
- response = httpx.get(url)
- response.raise_for_status()
- return BeautifulSoup(response.text, "html.parser")
-
- def parse_author(article_soup):
- link = article_soup.select_one('a[rel="author"]')
- if link:
- return link.text.strip()
- address = article_soup.select_one('aside address')
- if address:
- return address.text.strip()
- return None
-
- listing_url = "https://www.theguardian.com/sport/formulaone"
- listing_soup = download(listing_url)
- for item in listing_soup.select("#maincontent ul li"):
- link = item.select_one("a")
- article_url = urljoin(listing_url, link["href"])
- article_soup = download(article_url)
- title = article_soup.select_one("h1").text.strip()
- author = parse_author(article_soup)
- print(f"{author}: {title}")
- ```
+```py
+import httpx
+from bs4 import BeautifulSoup
+from urllib.parse import urljoin
+
+def download(url):
+ response = httpx.get(url)
+ response.raise_for_status()
+ return BeautifulSoup(response.text, "html.parser")
+
+def parse_author(article_soup):
+ link = article_soup.select_one('a[rel="author"]')
+ if link:
+ return link.text.strip()
+ address = article_soup.select_one('aside address')
+ if address:
+ return address.text.strip()
+ return None
+
+listing_url = "https://www.theguardian.com/sport/formulaone"
+listing_soup = download(listing_url)
+for item in listing_soup.select("#maincontent ul li"):
+ link = item.select_one("a")
+ article_url = urljoin(listing_url, link["href"])
+ article_soup = download(article_url)
+ title = article_soup.select_one("h1").text.strip()
+ author = parse_author(article_soup)
+ print(f"{author}: {title}")
+```
diff --git a/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md b/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md
index e47affbaec..7b3f23088d 100644
--- a/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md
+++ b/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md
@@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi
slug: /scraping-basics-python/scraping-variants
---
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
**In this lesson, we'll scrape the product detail pages to represent each product variant as a separate item in our dataset.**
@@ -19,20 +19,43 @@ First, let's extract information about the variants. If we go to [Sony XBR-950G
```html
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
```
@@ -46,21 +69,23 @@ After a bit of detective work, we notice that not far below the `block-swatch-li
```html
-
-
-
-
+
+
+
+
```
@@ -103,6 +128,7 @@ Since Python 3.9, you can use `|` to merge two dictionaries. If the [docs](https
If we run the program now, we'll see 34 items in total. Some items don't have variants, so they won't have a variant name. However, they should still have a price set—our scraper should already have that info from the product listing page.
+
```json title=products.json
[
...
@@ -121,6 +147,7 @@ If we run the program now, we'll see 34 items in total. Some items don't have va
Some products will break into several items, each with a different variant name. We don't know their exact prices from the product listing, just the min price. In the next step, we should be able to parse the actual price from the variant name for those items.
+
```json title=products.json
[
...
@@ -147,6 +174,7 @@ Some products will break into several items, each with a different variant name.
Perhaps surprisingly, some products with variants will have the price field set. That's because the shop sells all variants of the product for the same price, so the product listing shows the price as a fixed amount, like _$74.95_, instead of _from $74.95_.
+
```json title=products.json
[
...
@@ -272,6 +300,7 @@ with open("products.csv", "w") as file:
Let's run the scraper and see if all the items in the data contain prices:
+
```json title=products.json
[
...
@@ -340,35 +369,35 @@ You can find everything you need for working with dates and times in Python's [`
Solution
- After inspecting the job board, you'll notice that job postings tagged as "Database" have a dedicated URL. We'll use that as our starting point, which saves us from having to scrape and check the tags manually.
-
- ```py
- from pprint import pp
- import httpx
- from bs4 import BeautifulSoup
- from urllib.parse import urljoin
- from datetime import datetime, date, timedelta
-
- today = date.today()
- jobs_url = "https://www.python.org/jobs/type/database/"
- response = httpx.get(jobs_url)
- response.raise_for_status()
- soup = BeautifulSoup(response.text, "html.parser")
-
- for job in soup.select(".list-recent-jobs li"):
- link = job.select_one(".listing-company-name a")
+After inspecting the job board, you'll notice that job postings tagged as "Database" have a dedicated URL. We'll use that as our starting point, which saves us from having to scrape and check the tags manually.
- time = job.select_one(".listing-posted time")
- posted_at = datetime.fromisoformat(time["datetime"])
- posted_on = posted_at.date()
- posted_ago = today - posted_on
-
- if posted_ago <= timedelta(days=60):
- title = link.text.strip()
- company = list(job.select_one(".listing-company-name").stripped_strings)[-1]
- url = urljoin(jobs_url, link["href"])
- pp({"title": title, "company": company, "url": url, "posted_on": posted_on})
- ```
+```py
+from pprint import pp
+import httpx
+from bs4 import BeautifulSoup
+from urllib.parse import urljoin
+from datetime import datetime, date, timedelta
+
+today = date.today()
+jobs_url = "https://www.python.org/jobs/type/database/"
+response = httpx.get(jobs_url)
+response.raise_for_status()
+soup = BeautifulSoup(response.text, "html.parser")
+
+for job in soup.select(".list-recent-jobs li"):
+ link = job.select_one(".listing-company-name a")
+
+ time = job.select_one(".listing-posted time")
+ posted_at = datetime.fromisoformat(time["datetime"])
+ posted_on = posted_at.date()
+ posted_ago = today - posted_on
+
+ if posted_ago <= timedelta(days=60):
+ title = link.text.strip()
+ company = list(job.select_one(".listing-company-name").stripped_strings)[-1]
+ url = urljoin(jobs_url, link["href"])
+ pp({"title": title, "company": company, "url": url, "posted_on": posted_on})
+```
@@ -387,32 +416,32 @@ At the time of writing, the shortest article on the CNN Sports homepage is [abou
Solution
- ```py
- import httpx
- from bs4 import BeautifulSoup
- from urllib.parse import urljoin
-
- def download(url):
- response = httpx.get(url)
- response.raise_for_status()
- return BeautifulSoup(response.text, "html.parser")
-
- listing_url = "https://edition.cnn.com/sport"
- listing_soup = download(listing_url)
-
- data = []
- for card in listing_soup.select(".layout__main .card"):
- link = card.select_one(".container__link")
- article_url = urljoin(listing_url, link["href"])
- article_soup = download(article_url)
- if content := article_soup.select_one(".article__content"):
- length = len(content.get_text())
- data.append((length, article_url))
-
- data.sort()
- shortest_item = data[0]
- item_url = shortest_item[1]
- print(item_url)
- ```
+```py
+import httpx
+from bs4 import BeautifulSoup
+from urllib.parse import urljoin
+
+def download(url):
+ response = httpx.get(url)
+ response.raise_for_status()
+ return BeautifulSoup(response.text, "html.parser")
+
+listing_url = "https://edition.cnn.com/sport"
+listing_soup = download(listing_url)
+
+data = []
+for card in listing_soup.select(".layout__main .card"):
+ link = card.select_one(".container__link")
+ article_url = urljoin(listing_url, link["href"])
+ article_soup = download(article_url)
+ if content := article_soup.select_one(".article__content"):
+ length = len(content.get_text())
+ data.append((length, article_url))
+
+data.sort()
+shortest_item = data[0]
+item_url = shortest_item[1]
+print(item_url)
+```
diff --git a/sources/academy/webscraping/scraping_basics_python/12_framework.md b/sources/academy/webscraping/scraping_basics_python/12_framework.md
index 6f8861785d..7d4feb528f 100644
--- a/sources/academy/webscraping/scraping_basics_python/12_framework.md
+++ b/sources/academy/webscraping/scraping_basics_python/12_framework.md
@@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi
slug: /scraping-basics-python/framework
---
-import Exercises from '../scraping_basics/_exercises.mdx';
+import Exercises from '../scraping_basics/\_exercises.mdx';
**In this lesson, we'll rework our application for watching prices so that it builds on top of a scraping framework. We'll use Crawlee to make the program simpler, faster, and more robust.**
@@ -431,6 +431,7 @@ Scrape information about all [F1 Academy](https://en.wikipedia.org/wiki/F1_Acade
If you export the dataset as JSON, it should look something like this:
+
```json
[
{
@@ -463,48 +464,48 @@ If you export the dataset as JSON, it should look something like this:
Solution
- ```py
- import asyncio
- from datetime import datetime
-
- from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
-
- async def main():
- crawler = BeautifulSoupCrawler()
-
- @crawler.router.default_handler
- async def handle_listing(context: BeautifulSoupCrawlingContext):
- await context.enqueue_links(selector=".teams-driver-item a", label="DRIVER")
-
- @crawler.router.handler("DRIVER")
- async def handle_driver(context: BeautifulSoupCrawlingContext):
- info = {}
- for row in context.soup.select(".common-driver-info li"):
- name = row.select_one("span").text.strip()
- value = row.select_one("h4").text.strip()
- info[name] = value
-
- detail = {}
- for row in context.soup.select(".driver-detail--cta-group a"):
- name = row.select_one("p").text.strip()
- value = row.select_one("h2").text.strip()
- detail[name] = value
-
- await context.push_data({
- "url": context.request.url,
- "name": context.soup.select_one("h1").text.strip(),
- "team": detail["Team"],
- "nationality": info["Nationality"],
- "dob": datetime.strptime(info["DOB"], "%d/%m/%Y").date(),
- "instagram_url": context.soup.select_one(".common-social-share a[href*='instagram']").get("href"),
- })
-
- await crawler.run(["https://www.f1academy.com/Racing-Series/Drivers"])
- await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2)
-
- if __name__ == '__main__':
- asyncio.run(main())
- ```
+```py
+import asyncio
+from datetime import datetime
+
+from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
+
+async def main():
+ crawler = BeautifulSoupCrawler()
+
+ @crawler.router.default_handler
+ async def handle_listing(context: BeautifulSoupCrawlingContext):
+ await context.enqueue_links(selector=".teams-driver-item a", label="DRIVER")
+
+ @crawler.router.handler("DRIVER")
+ async def handle_driver(context: BeautifulSoupCrawlingContext):
+ info = {}
+ for row in context.soup.select(".common-driver-info li"):
+ name = row.select_one("span").text.strip()
+ value = row.select_one("h4").text.strip()
+ info[name] = value
+
+ detail = {}
+ for row in context.soup.select(".driver-detail--cta-group a"):
+ name = row.select_one("p").text.strip()
+ value = row.select_one("h2").text.strip()
+ detail[name] = value
+
+ await context.push_data({
+ "url": context.request.url,
+ "name": context.soup.select_one("h1").text.strip(),
+ "team": detail["Team"],
+ "nationality": info["Nationality"],
+ "dob": datetime.strptime(info["DOB"], "%d/%m/%Y").date(),
+ "instagram_url": context.soup.select_one(".common-social-share a[href*='instagram']").get("href"),
+ })
+
+ await crawler.run(["https://www.f1academy.com/Racing-Series/Drivers"])
+ await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2)
+
+if __name__ == '__main__':
+ asyncio.run(main())
+```
@@ -519,6 +520,7 @@ The [Global Top 10](https://www.netflix.com/tudum/top10) page has a table listin
If you export the dataset as JSON, it should look something like this:
+
```json
[
{
@@ -564,44 +566,44 @@ When navigating to the first IMDb search result, you might find it helpful to kn
Solution
- ```py
- import asyncio
- from urllib.parse import quote_plus
-
- from crawlee import Request
- from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
-
- async def main():
- crawler = BeautifulSoupCrawler()
-
- @crawler.router.default_handler
- async def handle_netflix_table(context: BeautifulSoupCrawlingContext):
- requests = []
- for name_cell in context.soup.select('[data-uia="top10-table-row-title"] button'):
- name = name_cell.text.strip()
- imdb_search_url = f"https://www.imdb.com/find/?q={quote_plus(name)}&s=tt&ttype=ft"
- requests.append(Request.from_url(imdb_search_url, label="IMDB_SEARCH"))
- await context.add_requests(requests)
-
- @crawler.router.handler("IMDB_SEARCH")
- async def handle_imdb_search(context: BeautifulSoupCrawlingContext):
- await context.enqueue_links(selector=".find-result-item a", label="IMDB", limit=1)
-
- @crawler.router.handler("IMDB")
- async def handle_imdb(context: BeautifulSoupCrawlingContext):
- rating_selector = "[data-testid='hero-rating-bar__aggregate-rating__score']"
- rating_text = context.soup.select_one(rating_selector).text.strip()
- await context.push_data({
- "url": context.request.url,
- "title": context.soup.select_one("h1").text.strip(),
- "rating": rating_text,
- })
-
- await crawler.run(["https://www.netflix.com/tudum/top10"])
- await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2)
-
- if __name__ == '__main__':
- asyncio.run(main())
- ```
+```py
+import asyncio
+from urllib.parse import quote_plus
+
+from crawlee import Request
+from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
+
+async def main():
+ crawler = BeautifulSoupCrawler()
+
+ @crawler.router.default_handler
+ async def handle_netflix_table(context: BeautifulSoupCrawlingContext):
+ requests = []
+ for name_cell in context.soup.select('[data-uia="top10-table-row-title"] button'):
+ name = name_cell.text.strip()
+ imdb_search_url = f"https://www.imdb.com/find/?q={quote_plus(name)}&s=tt&ttype=ft"
+ requests.append(Request.from_url(imdb_search_url, label="IMDB_SEARCH"))
+ await context.add_requests(requests)
+
+ @crawler.router.handler("IMDB_SEARCH")
+ async def handle_imdb_search(context: BeautifulSoupCrawlingContext):
+ await context.enqueue_links(selector=".find-result-item a", label="IMDB", limit=1)
+
+ @crawler.router.handler("IMDB")
+ async def handle_imdb(context: BeautifulSoupCrawlingContext):
+ rating_selector = "[data-testid='hero-rating-bar__aggregate-rating__score']"
+ rating_text = context.soup.select_one(rating_selector).text.strip()
+ await context.push_data({
+ "url": context.request.url,
+ "title": context.soup.select_one("h1").text.strip(),
+ "rating": rating_text,
+ })
+
+ await crawler.run(["https://www.netflix.com/tudum/top10"])
+ await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2)
+
+if __name__ == '__main__':
+ asyncio.run(main())
+```
diff --git a/sources/academy/webscraping/scraping_basics_python/13_platform.md b/sources/academy/webscraping/scraping_basics_python/13_platform.md
index 7496a63661..3002decd1b 100644
--- a/sources/academy/webscraping/scraping_basics_python/13_platform.md
+++ b/sources/academy/webscraping/scraping_basics_python/13_platform.md
@@ -205,9 +205,7 @@ Inside `warehouse-watchdog`, there's a directory called `.actor`. Within it, we'
"title": "Start URLs",
"type": "array",
"description": "URLs to start with",
- "prefill": [
- { "url": "https://apify.com" }
- ],
+ "prefill": [{ "url": "https://apify.com" }],
"editor": "requestListSources"
}
},
diff --git a/sources/legal/latest/policies/community-code-of-conduct.md b/sources/legal/latest/policies/community-code-of-conduct.md
index 44560c4645..f6279966da 100644
--- a/sources/legal/latest/policies/community-code-of-conduct.md
+++ b/sources/legal/latest/policies/community-code-of-conduct.md
@@ -36,23 +36,23 @@ The following are not hard and fast rules, merely aids to the human judgment of
We are committed to maintaining a community where users are free to express themselves and challenge one another's ideas, both technical and otherwise. At the same time, it's important that users remain respectful and allow space for others to contribute openly. In order to foster both a safe and productive environment, we encourage our community members to look to these guidelines to inform how they interact on our platform. Below, you’ll find some suggestions for how to have successful interactions as a valued member of the Apify community.
- **Engage with consideration and respect**.
- - **Be welcoming and open-minded**. - New users join our community each day. Some are well-established developers, while others are just beginning. Be open to other ideas and experience levels. Make room for opinions other than your own and be welcoming to new collaborators and those just getting started.
- - **Be respectful**. - Working in a collaborative environment means disagreements may happen. But remember to criticize ideas, not people. Share thoughtful, constructive criticism and be courteous to those you interact with. If you’re unable to engage respectfully, consider taking a step back or using some of our moderation tools to deescalate a tense situation.
- - **Be empathetic**. - Apify is a global community with people from a wide variety of backgrounds and perspectives, many of which may not be your own. Try to put yourself in others’ shoes and understand their feelings before you address them. Do your best to help make Apify a community where others feel safe to make contributions, participate in discussions, and share different ideas.
+ - **Be welcoming and open-minded**. - New users join our community each day. Some are well-established developers, while others are just beginning. Be open to other ideas and experience levels. Make room for opinions other than your own and be welcoming to new collaborators and those just getting started.
+ - **Be respectful**. - Working in a collaborative environment means disagreements may happen. But remember to criticize ideas, not people. Share thoughtful, constructive criticism and be courteous to those you interact with. If you’re unable to engage respectfully, consider taking a step back or using some of our moderation tools to deescalate a tense situation.
+ - **Be empathetic**. - Apify is a global community with people from a wide variety of backgrounds and perspectives, many of which may not be your own. Try to put yourself in others’ shoes and understand their feelings before you address them. Do your best to help make Apify a community where others feel safe to make contributions, participate in discussions, and share different ideas.
- **Contribute in a positive and constructive way**.
- - **Improve the discussion**. Help us make this a great place for discussion by always working to improve the discussion in some way, however small. If you are not sure your post adds to the conversation, think over what you want to say and try again later. The topics discussed here matter to us, and we want you to act as if they matter to you, too. Be respectful of the topics and the people discussing them, even if you disagree with some of what is being said.
- - **Be clear and stay on topic**. The Apify community is for collaboration, sharing ideas, and helping each other get stuff done. Off-topic comments are a distraction (sometimes welcome, but usually not) from getting work done and being productive. Staying on topic helps produce positive and productive discussions.
- This applies to sharing links, as well. Any links shared in Apify community discussions should be shared with the intent of providing relevant and appropriate information. Links should not be posted to simply drive traffic or attention to a site. Links should always be accompanied by a full explanation of the content and purpose of the link. Posting links, especially unsolicited ones, without relevant and valuable context, can come across as advertising or serving even more malicious purposes.
- - **Share mindfully**. When asking others to give you feedback or collaborate on a project, only share valuable and relevant resources to provide context. Don't post links that don't add value to the discussion, and don't post unsolicited links to your own projects or sites on other user's threads.
- Additionally, don't share sensitive information. This includes your own email address. We don't allow the sharing of such information in Apify community, as it can create security and privacy risks for the poster, as well as other users.
- - **Keep it tidy**. Make the effort to put things in the right place, so that we can spend more time discussing and less time cleaning up. So:
- - Don’t start a discussion in the wrong category.
- - Don’t cross-post the same thing in multiple discussions.
- - Don’t post no-content replies.
- - Don't "bump" posts, unless you have new and relevant information to share.
- - Don’t divert a discussion by changing it midstream.
+ - **Improve the discussion**. Help us make this a great place for discussion by always working to improve the discussion in some way, however small. If you are not sure your post adds to the conversation, think over what you want to say and try again later. The topics discussed here matter to us, and we want you to act as if they matter to you, too. Be respectful of the topics and the people discussing them, even if you disagree with some of what is being said.
+ - **Be clear and stay on topic**. The Apify community is for collaboration, sharing ideas, and helping each other get stuff done. Off-topic comments are a distraction (sometimes welcome, but usually not) from getting work done and being productive. Staying on topic helps produce positive and productive discussions.
+ This applies to sharing links, as well. Any links shared in Apify community discussions should be shared with the intent of providing relevant and appropriate information. Links should not be posted to simply drive traffic or attention to a site. Links should always be accompanied by a full explanation of the content and purpose of the link. Posting links, especially unsolicited ones, without relevant and valuable context, can come across as advertising or serving even more malicious purposes.
+ - **Share mindfully**. When asking others to give you feedback or collaborate on a project, only share valuable and relevant resources to provide context. Don't post links that don't add value to the discussion, and don't post unsolicited links to your own projects or sites on other user's threads.
+ Additionally, don't share sensitive information. This includes your own email address. We don't allow the sharing of such information in Apify community, as it can create security and privacy risks for the poster, as well as other users.
+ - **Keep it tidy**. Make the effort to put things in the right place, so that we can spend more time discussing and less time cleaning up. So:
+ - Don’t start a discussion in the wrong category.
+ - Don’t cross-post the same thing in multiple discussions.
+ - Don’t post no-content replies.
+ - Don't "bump" posts, unless you have new and relevant information to share.
+ - Don’t divert a discussion by changing it midstream.
- **Be trustworthy**.
- - **Always be honest**. Don’t knowingly share incorrect information or intentionally mislead other Apify community participants. If you don’t know the answer to someone’s question but still want to help, you can try helping them research or find resources instead. Apify staff will also be active in Apify community, so if you’re unsure of an answer, it’s likely a moderator will be able to help.
+ - **Always be honest**. Don’t knowingly share incorrect information or intentionally mislead other Apify community participants. If you don’t know the answer to someone’s question but still want to help, you can try helping them research or find resources instead. Apify staff will also be active in Apify community, so if you’re unsure of an answer, it’s likely a moderator will be able to help.
## What is not allowed
diff --git a/sources/legal/latest/policies/cookie-policy.md b/sources/legal/latest/policies/cookie-policy.md
index 010844f0fa..b7dffd8a2b 100644
--- a/sources/legal/latest/policies/cookie-policy.md
+++ b/sources/legal/latest/policies/cookie-policy.md
@@ -48,41 +48,41 @@ These cookies may be set through our site by our advertising partners. They may
None of our cookies last forever. You can always choose to delete cookies from your computer at any time. Even if you do not delete them yourself, our cookies are set to expire automatically after some time. Some cookies will be deleted as soon as you close your browser (so-called “session cookies”), some cookies will stay on your device until you delete them or they expire (so called “persistent cookies”). You can see from the table below the lifespan of each type of cookie that we use; session cookies are those marked with 0 days' expiration, all other cookies are persistent, and you can see the number of days they last before they automatically expire. The expiration periods work on a rolling basis, i.e., each time you visit our website again, the period restarts.
-| Cookie name | Cookie description | Type | Expiration (in days) |
-|------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------|----------------------|
-| AWSALB | AWS ELB application load balancer | Strictly necessary | 6 |
-| OptanonConsent | This cookie is set by the cookie compliance solution from OneTrust. It stores information about the categories of cookies the site uses and whether visitors have given or withdrawn consent for the use of each category. This enables site owners to prevent cookies in each category from being set in the user's browser, when consent is not given. The cookie has a normal lifespan of one year, so that returning visitors to the site will have their preferences remembered. It contains no information that can identify the site visitor. | Strictly necessary | 364 |
-| AWSALBCORS | This cookie is managed by AWS and is used for load balancing. | Strictly necessary | 6 |
-| ApifyProdUserId | This cookie is created by Apify after a user signs into their account and is used across Apify domains to identify if the user is signed in. | Strictly necessary | 0 |
-| ApifyProdUser | This cookie is created by Apify after a user signs into their account and is used across Apify domains to identify if the user is signed in. | Strictly necessary | 0 |
-| intercom-id-kod1r788 | This cookie is used by Intercom service to identify user sessions for customer support chat. | Strictly necessary | 270 |
-| intercom-session-kod1r788 | This cookie is used by Intercom service to identify user sessions for customer support chat. | Strictly necessary | 6 |
-| \_gaexp\_rc | \_ga | Performance | 0 |
-| \_hjTLDTest | When the Hotjar script executes we try to determine the most generic cookie path we should use, instead of the page hostname. This is done so that cookies can be shared across subdomains (where applicable). To determine this, we try to store the \_hjTLDTest cookie for different URL substring alternatives until it fails. After this check, the cookie is removed. | Performance | 0 |
-| \_hjSessionUser\_1441872 | Hotjar cookie that is set when a user first lands on a page with the Hotjar script. It is used to persist the Hotjar User ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID. | Performance | 364 |
-| \_hjIncludedInPageviewSample | This cookie is set to let Hotjar know whether that visitor is included in the data sampling defined by your site's pageview limit. | Performance | 0 |
-| \_ga | This cookie name is associated with Google Universal Analytics - which is a significant update to Google's more commonly used analytics service. This cookie is used to distinguish unique users by assigning a randomly generated number as a client identifier. It is included in each page request in a site and used to calculate visitor, session and campaign data for the sites analytics reports. By default it is set to expire after 2 years, although this is customisable by website owners. \_ga | Performance | 729 |
-| \_ga\_F50Z86TBGX | \_ga | Performance | 729 |
-| \_hjIncludedInSessionSample | This cookie is set to let Hotjar know whether that visitor is included in the data sampling defined by your site's daily session limit. | Performance | 0 |
-| \_hjFirstSeen | Identifies a new user's first session on a website, indicating whether or not Hotjar's seeing this user for the first time. | Performance | 0 |
-| \_gclxxxx | Google conversion tracking cookie | Performance | 89 |
-| \_hjAbsoluteSessionInProgress | This cookie is used by HotJar to detect the first pageview session of a user. This is a True/False flag set by the cookie. | Performance | 0 |
-| \_\_hssc | This cookie name is associated with websites built on the HubSpot platform. It is reported by them as being used for website analytics. | Performance | 0 |
-| \_gaexp | Used to determine a user's inclusion in an experiment and the expiry of experiments a user has been included in.\_ga | Performance | 43 |
-| \_hjIncludedInPageviewSample | This cookie is set to let Hotjar know whether that visitor is included in the data sampling defined by your site's pageview limit. | Performance | 0 |
-| \_gat\_UA-nnnnnnn-nn | This is a pattern type cookie set by Google Analytics, where the pattern element on the name contains the unique identity number of the account or website it relates to. It appears to be a variation of the \_gat cookie which is used to limit the amount of data recorded by Google on high traffic volume websites. | Performance | 0 |
-| \_\_hstc | This cookie name is associated with websites built on the HubSpot platform. It is reported by them as being used for website analytics. | Performance | 389 |
-| \_hjIncludedInSessionSample | This cookie is set to let Hotjar know whether that visitor is included in the data sampling defined by your site's daily session limit. | Performance | 0 |
-| \_hjSession\_1441872 | A cookie that holds the current session data. This ensures that subsequent requests within the session window will be attributed to the same Hotjar session. | Performance | 0 |
-| \_gid | This cookie name is associated with Google Universal Analytics. This appears to be a new cookie and as of Spring 2017 no information is available from Google. It appears to store and update a unique value for each page visited.\_gid | Performance | 0 |
-| \_gat | This cookie name is associated with Google Universal Analytics, according to documentation it is used to throttle the request rate - limiting the collection of data on high traffic sites. It expires after 10 minutes.\_ga | Performance | 0 |
-| \_\_hssrc | This cookie name is associated with websites built on the HubSpot platform. It is reported by them as being used for website analytics. | Performance | 0 |
-| ApifyAcqRef | This cookie is used by Apify to identify from which website the user came to Apify. | Performance | 364 |
-| ApifyAcqSrc | This cookie is used by Apify to identify from which website the user came to Apify. | Performance | 364 |
-| hubspotutk | This cookie name is associated with websites built on the HubSpot platform. HubSpot report that its purpose is user authentication. As a persistent rather than a session cookie it cannot be classified as Strictly Necessary. | Functional | 389 |
-| \_ALGOLIA | This cookie name is associated with websites built on the HubSpot platform. HubSpot report that its purpose is user authentication. As a persistent rather than a session cookie it cannot be classified as Strictly Necessary. | Functional | 179 |
-| kvcd | Social Media sharing tracking cookie. | Targeting | 0 |
-| \_gat\_gtag\_xxxxxxxxxxxxxxxxxxxxxxxxxxx | Google Analytics | Targeting | 0 |
-| km\_vs | Social Media sharing tracking cookie. | Targeting | 0 |
+| Cookie name | Cookie description | Type | Expiration (in days) |
+| -------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------ | -------------------- |
+| AWSALB | AWS ELB application load balancer | Strictly necessary | 6 |
+| OptanonConsent | This cookie is set by the cookie compliance solution from OneTrust. It stores information about the categories of cookies the site uses and whether visitors have given or withdrawn consent for the use of each category. This enables site owners to prevent cookies in each category from being set in the user's browser, when consent is not given. The cookie has a normal lifespan of one year, so that returning visitors to the site will have their preferences remembered. It contains no information that can identify the site visitor. | Strictly necessary | 364 |
+| AWSALBCORS | This cookie is managed by AWS and is used for load balancing. | Strictly necessary | 6 |
+| ApifyProdUserId | This cookie is created by Apify after a user signs into their account and is used across Apify domains to identify if the user is signed in. | Strictly necessary | 0 |
+| ApifyProdUser | This cookie is created by Apify after a user signs into their account and is used across Apify domains to identify if the user is signed in. | Strictly necessary | 0 |
+| intercom-id-kod1r788 | This cookie is used by Intercom service to identify user sessions for customer support chat. | Strictly necessary | 270 |
+| intercom-session-kod1r788 | This cookie is used by Intercom service to identify user sessions for customer support chat. | Strictly necessary | 6 |
+| \_gaexp_rc | \_ga | Performance | 0 |
+| \_hjTLDTest | When the Hotjar script executes we try to determine the most generic cookie path we should use, instead of the page hostname. This is done so that cookies can be shared across subdomains (where applicable). To determine this, we try to store the \_hjTLDTest cookie for different URL substring alternatives until it fails. After this check, the cookie is removed. | Performance | 0 |
+| \_hjSessionUser_1441872 | Hotjar cookie that is set when a user first lands on a page with the Hotjar script. It is used to persist the Hotjar User ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID. | Performance | 364 |
+| \_hjIncludedInPageviewSample | This cookie is set to let Hotjar know whether that visitor is included in the data sampling defined by your site's pageview limit. | Performance | 0 |
+| \_ga | This cookie name is associated with Google Universal Analytics - which is a significant update to Google's more commonly used analytics service. This cookie is used to distinguish unique users by assigning a randomly generated number as a client identifier. It is included in each page request in a site and used to calculate visitor, session and campaign data for the sites analytics reports. By default it is set to expire after 2 years, although this is customisable by website owners. \_ga | Performance | 729 |
+| \_ga_F50Z86TBGX | \_ga | Performance | 729 |
+| \_hjIncludedInSessionSample | This cookie is set to let Hotjar know whether that visitor is included in the data sampling defined by your site's daily session limit. | Performance | 0 |
+| \_hjFirstSeen | Identifies a new user's first session on a website, indicating whether or not Hotjar's seeing this user for the first time. | Performance | 0 |
+| \_gclxxxx | Google conversion tracking cookie | Performance | 89 |
+| \_hjAbsoluteSessionInProgress | This cookie is used by HotJar to detect the first pageview session of a user. This is a True/False flag set by the cookie. | Performance | 0 |
+| \_\_hssc | This cookie name is associated with websites built on the HubSpot platform. It is reported by them as being used for website analytics. | Performance | 0 |
+| \_gaexp | Used to determine a user's inclusion in an experiment and the expiry of experiments a user has been included in.\_ga | Performance | 43 |
+| \_hjIncludedInPageviewSample | This cookie is set to let Hotjar know whether that visitor is included in the data sampling defined by your site's pageview limit. | Performance | 0 |
+| \_gat_UA-nnnnnnn-nn | This is a pattern type cookie set by Google Analytics, where the pattern element on the name contains the unique identity number of the account or website it relates to. It appears to be a variation of the \_gat cookie which is used to limit the amount of data recorded by Google on high traffic volume websites. | Performance | 0 |
+| \_\_hstc | This cookie name is associated with websites built on the HubSpot platform. It is reported by them as being used for website analytics. | Performance | 389 |
+| \_hjIncludedInSessionSample | This cookie is set to let Hotjar know whether that visitor is included in the data sampling defined by your site's daily session limit. | Performance | 0 |
+| \_hjSession_1441872 | A cookie that holds the current session data. This ensures that subsequent requests within the session window will be attributed to the same Hotjar session. | Performance | 0 |
+| \_gid | This cookie name is associated with Google Universal Analytics. This appears to be a new cookie and as of Spring 2017 no information is available from Google. It appears to store and update a unique value for each page visited.\_gid | Performance | 0 |
+| \_gat | This cookie name is associated with Google Universal Analytics, according to documentation it is used to throttle the request rate - limiting the collection of data on high traffic sites. It expires after 10 minutes.\_ga | Performance | 0 |
+| \_\_hssrc | This cookie name is associated with websites built on the HubSpot platform. It is reported by them as being used for website analytics. | Performance | 0 |
+| ApifyAcqRef | This cookie is used by Apify to identify from which website the user came to Apify. | Performance | 364 |
+| ApifyAcqSrc | This cookie is used by Apify to identify from which website the user came to Apify. | Performance | 364 |
+| hubspotutk | This cookie name is associated with websites built on the HubSpot platform. HubSpot report that its purpose is user authentication. As a persistent rather than a session cookie it cannot be classified as Strictly Necessary. | Functional | 389 |
+| \_ALGOLIA | This cookie name is associated with websites built on the HubSpot platform. HubSpot report that its purpose is user authentication. As a persistent rather than a session cookie it cannot be classified as Strictly Necessary. | Functional | 179 |
+| kvcd | Social Media sharing tracking cookie. | Targeting | 0 |
+| \_gat_gtag_xxxxxxxxxxxxxxxxxxxxxxxxxxx | Google Analytics | Targeting | 0 |
+| km_vs | Social Media sharing tracking cookie. | Targeting | 0 |
_\*Please note that the table serves for general information purposes. The information included in it may change over time and the table may be updated from time to time accordingly._
diff --git a/sources/legal/latest/policies/gdpr-information.md b/sources/legal/latest/policies/gdpr-information.md
index 5874ef21e3..5df3d3b8ff 100644
--- a/sources/legal/latest/policies/gdpr-information.md
+++ b/sources/legal/latest/policies/gdpr-information.md
@@ -26,9 +26,9 @@ First and foremost, we process data that is necessary for us to perform our cont
### What are these ‘legitimate interests’?
-* Improving our Website, Platform and Services to help you reach new levels of productivity.
-* Making sure that your data and Apify's systems are safe and secure.
-* Responsible marketing of our product and its features.
+- Improving our Website, Platform and Services to help you reach new levels of productivity.
+- Making sure that your data and Apify's systems are safe and secure.
+- Responsible marketing of our product and its features.
### What rights do you have in connection with your personal data processing?
@@ -39,11 +39,10 @@ First and foremost, we process data that is necessary for us to perform our cont
3. **Right to erasure:** you have the right to have your personal data erased if (i) they are no longer necessary in relation to the purposes for which they were collected or otherwise processed (ii) the processing was unlawful, (iii) you object to the processing and there are no overriding legitimate grounds for processing your personal data, or the law requires erasure, (iv) we are required to erase data under our legal obligation, or (v) you withdrew your consent to the processing of personal data (if processed based on such consent).
4. **Right to restriction of processing:** if you request to obtain restriction of processing, we are only allowed to store personal data, not further process it, with the exceptions set out in the GDPR. You may exercise the right to restriction in the following cases:
-
- * If you contest the accuracy of your personal data; in this case, the restrictions apply for the time necessary for us to verify the accuracy of the personal data.
- * If we process your personal data unlawfully, but instead of erasure you request only restriction of their use.
- * If we no longer need your personal data for the above-mentioned purposes of processing, but you request the data for the establishment, exercise, or defense of legal claims.
- * If you object to processing, the data processing is restricted pending the verification whether our legitimate interest override yours.
+ - If you contest the accuracy of your personal data; in this case, the restrictions apply for the time necessary for us to verify the accuracy of the personal data.
+ - If we process your personal data unlawfully, but instead of erasure you request only restriction of their use.
+ - If we no longer need your personal data for the above-mentioned purposes of processing, but you request the data for the establishment, exercise, or defense of legal claims.
+ - If you object to processing, the data processing is restricted pending the verification whether our legitimate interest override yours.
5. **Right to data portability:** if you wish us to transmit your personal data to another controller, you may exercise your right to data portability, if technically feasible. If the exercise of your right would adversely affect the rights and freedoms of other persons, we will not be able to comply with the request.
diff --git a/sources/legal/latest/policies/privacy-policy.md b/sources/legal/latest/policies/privacy-policy.md
index 5cc0fe95ff..aac2cfacd5 100644
--- a/sources/legal/latest/policies/privacy-policy.md
+++ b/sources/legal/latest/policies/privacy-policy.md
@@ -48,7 +48,6 @@ This Privacy Policy also does not apply to personal data about current and forme
- [Changes to our Privacy Policy](#changes-to-our-privacy-policy)
- [Contact Us](#contact-us)
-
## Personal Data We Collect
### Personal Data You Provide to Us
@@ -195,4 +194,3 @@ Vodičkova 704/36, Nové Město
110 00 Praha 1
Czech Republic
Attn: Apify Legal Team
-
diff --git a/sources/legal/latest/terms/candidate-referral-program-terms-and-conditions.md b/sources/legal/latest/terms/candidate-referral-program-terms-and-conditions.md
index 38e836e0a1..5bc1eb66ae 100644
--- a/sources/legal/latest/terms/candidate-referral-program-terms-and-conditions.md
+++ b/sources/legal/latest/terms/candidate-referral-program-terms-and-conditions.md
@@ -38,7 +38,7 @@ will receive a reward of **CZK 20,000** from Apify for each such Candidate.
If the Candidate is hired in a capacity other than full-time engagement, the reward will be prorated accordingly. If the Candidate transfers from part-time and/or “DPP/DPČ” to full-time engagement, you will not be entitled to any additional reward.
-A person will be considered a Candidate recommended by you only if you send the Candidate’s CV and contact details to the email address jobs[at]apify[dot]com. As it’s very important for Apify to respond promptly and avoid any inconveniences, Apify cannot accept any other method of recommendation. Sending resumes and information directly to jobs[at]apify[dot]com ensures that the entire Apify recruiting team receives the referral and can take care of the Candidate. When submitting the resume, please provide as much supporting information as possible about why Apify should hire the Candidate.
+A person will be considered a Candidate recommended by you only if you send the Candidate’s CV and contact details to the email address jobs[at]apify[dot]com. As it’s very important for Apify to respond promptly and avoid any inconveniences, Apify cannot accept any other method of recommendation. Sending resumes and information directly to jobs[at]apify[dot]com ensures that the entire Apify recruiting team receives the referral and can take care of the Candidate. When submitting the resume, please provide as much supporting information as possible about why Apify should hire the Candidate.
You shall become entitled to the reward after the Candidate’s probationary period successfully passes. Apify will issue a protocol confirming the payout of the reward. Reward payment is based on your signature of the protocol. It is payable by bank transfer to the account specified in the protocol within thirty (30) days from the date of the protocol signature.
diff --git a/sources/legal/latest/terms/challenge-terms-and-conditions.md b/sources/legal/latest/terms/challenge-terms-and-conditions.md
index 5fa84c7d91..0c60c36e5d 100644
--- a/sources/legal/latest/terms/challenge-terms-and-conditions.md
+++ b/sources/legal/latest/terms/challenge-terms-and-conditions.md
@@ -35,7 +35,7 @@ Participation in this Challenge is free and does not require the purchase of any
- Spamming: Promote your new Actors via Apify's Discord, Apify Console messaging, or Actor reviews.
- Low-Quality Submissions: Publish too many low-quality or spammy Actors, notwithstanding the fact that you may have published other Actors that are high-quality.
-1.4. Individuals or entities are not eligible to participate in the Challenge if they fail our KYC/KYB verification, are listed on any sanctions list, or are incorporated, headquartered, or controlled by residents in Russia.
+ 1.4. Individuals or entities are not eligible to participate in the Challenge if they fail our KYC/KYB verification, are listed on any sanctions list, or are incorporated, headquartered, or controlled by residents in Russia.
## 2. Actor Requirements
@@ -46,7 +46,7 @@ Participation in this Challenge is free and does not require the purchase of any
2.3. **Ineligible Actors**. The following types of Actors are not eligible for any rewards and may result in disqualification:
- Actors that use third-party software under a license that prohibits commercial use or redistribution of the resulting Actor.
-- Actors for scraping the following services or websites: YouTube, LinkedIn, Instagram, Facebook, TikTok, X, Apollo.io, Amazon, Google Maps, Google Search, Google Trends. Notwithstanding the foregoing, Actors that perform non-scraping functionality (e.g., AI agents, etc.) may be eligible.
+- Actors for scraping the following services or websites: YouTube, LinkedIn, Instagram, Facebook, TikTok, X, Apollo.io, Amazon, Google Maps, Google Search, Google Trends. Notwithstanding the foregoing, Actors that perform non-scraping functionality (e.g., AI agents, etc.) may be eligible.
- "Rental" or Pay per Result Actors. (Eligible Actors must be Pay per Usage or Pay per Event (or both).)
- Any existing Actors that have been renamed, substantially re-used, or based on a project existing prior to the Challenge start date.
diff --git a/sources/legal/latest/terms/data-processing-addendum.md b/sources/legal/latest/terms/data-processing-addendum.md
index f69d086f7d..403ab41a43 100644
--- a/sources/legal/latest/terms/data-processing-addendum.md
+++ b/sources/legal/latest/terms/data-processing-addendum.md
@@ -16,7 +16,6 @@ Last Updated: January 13, 2025
---
-
If you wish to execute this DPA, continue [here](https://eform.pandadoc.com/?eform=5344745e-5f8e-44eb-bcbd-1a2f45dbd692) and follow instructions in the PandaDoc form.
---
@@ -206,7 +205,6 @@ Apify Privacy Team, privacy[at]apify[dot]com
Activities relevant to the data transferred under these Clauses: Processing necessary to provide the Apify Platform and other Services by Apify to Customer and for any disclosures of Personal Data in accordance with the Agreement.
Role: Processor or Subprocessor, as applicable
-
Data importer:
Name: Customer's name identified in the Agreement
Address: Customer's address as provided in the Agreement
diff --git a/sources/legal/latest/terms/general-terms-and-conditions.md b/sources/legal/latest/terms/general-terms-and-conditions.md
index 1d55700813..1ebc8869cd 100644
--- a/sources/legal/latest/terms/general-terms-and-conditions.md
+++ b/sources/legal/latest/terms/general-terms-and-conditions.md
@@ -23,7 +23,7 @@ Apify Technologies s.r.o., with its registered seat at Vodičkova 704/36, 110 00
The Terms are the key document governing the relationship between you and us, please read the whole text of the Terms. For your convenience, we have presented these terms in a short non-binding summary followed by the full legal terms.
| Section | What can you find there? |
-|------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| ---------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| [1. Acceptance of these Terms](#1-acceptance-of-these-terms) | These Terms become a binding contract at the moment you sign-up on our Website. |
| [2. Our Services](#2-our-services) | Overview of the Services that we are providing. |
| [3. User Account](#3-user-account) | In order to use our Services you will create a user account. You must use true and accurate information when creating a user account. |
@@ -202,7 +202,7 @@ Unless otherwise provided hereby, any changes and amendments hereto may only be
This is the history of Apify General Terms and Conditions. If you're a new user, the latest Terms apply. If you're an existing user, see the table below to identify which terms and conditions were applicable to you at a given date.
| Version | Effective from | Effective until |
-|------------------------------------------------------------|-----------------|--------------------|
+| ---------------------------------------------------------- | --------------- | ------------------ |
| Latest (this document) | May 14, 2024 | |
| [Oct 2022](../../old/general-terms-and-conditions-2022.md) | October 1, 2022 | June 13, 2024 |
| Older T&Cs available upon request | | September 30, 2022 |
diff --git a/sources/legal/latest/terms/store-publishing-terms-and-conditions.md b/sources/legal/latest/terms/store-publishing-terms-and-conditions.md
index ff32a61164..3c57c26a84 100644
--- a/sources/legal/latest/terms/store-publishing-terms-and-conditions.md
+++ b/sources/legal/latest/terms/store-publishing-terms-and-conditions.md
@@ -64,7 +64,6 @@ We are authorized to unpublish and/or delete such an Actor, in our sole discreti
**7.1.** By publishing your Actor on Apify Store you are allowing us to view the source code of that Actor. We may only access and inspect the source code in limited circumstances where our interference is necessary for legal, compliance or security reasons, for example, when investigating the presence of any Prohibited Activities.
-
## 8. Maintenance of the Actor
**8.1.** By publishing your Actor you agree to use your best effort to maintain it in working condition and make updates to it from time to time as needed, in order to maintain a continuing functionality.
diff --git a/sources/legal/old/general-terms-and-conditions-2022.md b/sources/legal/old/general-terms-and-conditions-2022.md
index a3d8a8412f..11083f930d 100644
--- a/sources/legal/old/general-terms-and-conditions-2022.md
+++ b/sources/legal/old/general-terms-and-conditions-2022.md
@@ -13,7 +13,7 @@ slug: /old/general-terms-and-conditions-october-2022
You are reading terms and conditions that are no longer effective. If you're a new user, the [latest terms](../latest/terms/general-terms-and-conditions.md) apply. If you're an existing user, see the table below to identify which terms and conditions were applicable to you at a given date.
| Version | Effective from | Effective until |
-|-----------------------------------------------------------|-----------------|--------------------|
+| --------------------------------------------------------- | --------------- | ------------------ |
| [Latest](../latest/terms/general-terms-and-conditions.md) | May 13, 2024 | |
| Oct 2022 (This document) | October 1, 2022 | June 12, 2024 |
| Older T&Cs available upon request | | September 30, 2022 |
@@ -27,7 +27,7 @@ Apify Technologies s.r.o., with its registered seat at Vodičkova 704/36, 110 00
The Terms are the key document governing our relationship between you and us, please read the whole text of the Terms. For your convenience, below is a short summary of each section of the Terms.
| Section | What can you find there? |
-|------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| ---------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| [1. Acceptance of these Terms](#1-acceptance-of-these-terms) | These Terms become a binding contract at the moment you sign-up on our Website. |
| [2. Our Services](#2-our-services) | Overview of the Services that we are providing. |
| [3. User Account](#3-user-account) | In order to use our Services you will create a user account. You must use true and accurate information when creating a user account. |
@@ -112,9 +112,9 @@ Furthermore, during the use of the Website, Platform (or any of its functionalit
- Use them in a manner likely to unreasonably limit usage by our other customers, including but not limited to burdening the server on which the Platform is located by automated requests outside the interface designed for such a purpose;
- Gather, save, enable the transmission to third parties or enable access to the content that is (themselves or their accessibility) contradictory to the generally binding legal regulations effective in the Czech Republic and in any country in which you are a resident where the Website, Platform or Services are used or where detrimental consequences could arise by taking such actions, including but not limited to the content that:
- - interferes with the Copyright, with rights related to Copyright or with other intellectual property rights and/or confidential or any sensitive information;
- - breaches the applicable legal rules relevant to the protection from hatred for a nation, ethnic group, race, religion, class or another group of people or relevant to the limitation of rights and freedoms of its members or invasion of privacy, promotion of violence and animosity, gambling or the sales or usage of drugs;
- - interferes with the rights to the protection of competition law;
+ - interferes with the Copyright, with rights related to Copyright or with other intellectual property rights and/or confidential or any sensitive information;
+ - breaches the applicable legal rules relevant to the protection from hatred for a nation, ethnic group, race, religion, class or another group of people or relevant to the limitation of rights and freedoms of its members or invasion of privacy, promotion of violence and animosity, gambling or the sales or usage of drugs;
+ - interferes with the rights to the protection of competition law;
- Gather, save, enable the transmission to third parties or enable access to the content that is pornographic, humiliating or that refer to pornographic or humiliating materials;
- Gather, save, enable the transmission to third parties or enable access to the contents that make conspicuous resemblance to the contents, services or third-party applications for the purposes of confusing or deceiving Internet users (so-called phishing);
- Gather, save, enable the transmission to third parties or enable access to the contents that harm our good reputation or authorised interests (including hypertext links to the contents that harm our good reputation or authorised interests);
diff --git a/sources/legal/old/store-publishing-terms-and-conditions-2022.md b/sources/legal/old/store-publishing-terms-and-conditions-2022.md
index 434a412589..fc85efee2c 100644
--- a/sources/legal/old/store-publishing-terms-and-conditions-2022.md
+++ b/sources/legal/old/store-publishing-terms-and-conditions-2022.md
@@ -13,7 +13,7 @@ slug: /old/store-publishing-terms-and-conditions-december-2022
You are reading terms and conditions that are no longer effective. If you're a new user, the [latest terms](../latest/terms/store-publishing-terms-and-conditions.md) apply. If you're an existing user, see the table below to identify which terms and conditions were applicable to you at a given date.
| Version | Effective from | Effective until |
-|--------------------------------------------------------------------|------------------|-----------------|
+| ------------------------------------------------------------------ | ---------------- | --------------- |
| [Latest](../latest/terms/store-publishing-terms-and-conditions.md) | May 13, 2024 | |
| December 2022 (This document) | December 1, 2022 | June 12, 2024 |
@@ -31,7 +31,6 @@ Actors (i.e. the serverless cloud programs running on the Platform as defined in
By clicking a button “I agree”, you claim that you are over 18 years old and agree to adhere to these Apify Store Terms, in addition to the [Terms and the terms of personal data protection](../latest/policies/privacy-policy.md). If you act on behalf of a company when accepting these Apify Store Terms, you also hereby declare to be authorized to perform such legal actions on behalf of the company (herein the term “you” shall mean the relevant company).
-
## 3. Actor name, description and price
**3.1.** Each Actor has its own unique name. When you publish an Actor, you agree to assign to it a relevant, non-deceiving name
@@ -58,7 +57,6 @@ Without limitation to clause 5.2 above, we reserve the right to delete, unpublis
By publishing your Actor on the Platform you are allowing us to view the code of that Actor. We may only access and inspect the code in limited circumstances where our interference is necessary for legal, compliance or security reasons, e.g. when investigating presence of any Prohibited Content, suspicion of credentials stealing or account hacking.
-
## 8. Maintenance of the Actor
By publishing your Actor you agree to use your best effort to maintain it in working condition and make updates to it from time to time as needed, in order to maintain a continuing functionality.
@@ -77,7 +75,6 @@ If your Actor does not provide the declared functionality (a “**Faulty Actor**
**11.2.** In addition to responding according to clause 11.1 above, you agree to respond to us, should we contact you regarding your Actor via email marked “urgent” in its subject, within three business days.
-
## 12. Pricing options
When you decide to set your Actor as paid, you may choose one of the two following options for setting its price:
@@ -86,7 +83,6 @@ When you decide to set your Actor as paid, you may choose one of the two followi
**12.2. Price per Result** model which means that each user of your Actor will pay a fee calculated according to the number of results for each run of that Actor; You will set the price as X USD per 1,000 results. In this model the users do not pay for the Platform usage.
-
## 13. Payments to you
**13.1.** If you set your Actor as paid, you will be entitled to receive remuneration calculated as follows:
diff --git a/sources/platform/actors/development/actor_definition/actor_json.md b/sources/platform/actors/development/actor_definition/actor_json.md
index 8023a1fce0..22f89db93e 100644
--- a/sources/platform/actors/development/actor_definition/actor_json.md
+++ b/sources/platform/actors/development/actor_definition/actor_json.md
@@ -64,20 +64,20 @@ Actor `name`, `version`, `buildTag`, and `environmentVariables` are currently on
:::
-| Property | Type | Description |
-| --- | --- | --- |
-| `actorSpecification` | Required | The version of the Actor specification. This property must be set to `1`, which is the only version available. |
-| `name` | Required | The name of the Actor. |
-| `version` | Required | The version of the Actor, specified in the format `[Number].[Number]`, e.g., `0.1`, `0.3`, `1.0`, `1.3`, etc. |
-| `buildTag` | Optional | The tag name to be applied to a successful build of the Actor. If not specified, defaults to `latest`. Refer to the [builds](../builds_and_runs/builds.md) for more information. |
-| `environmentVariables` | Optional | A map of environment variables to be used during local development. These variables will also be applied to the Actor when deployed on the Apify platform. For more details, see the [environment variables](/cli/docs/vars) section of Apify CLI documentation. |
-| `dockerfile` | Optional | The path to the Dockerfile to be used for building the Actor on the platform. If not specified, the system will search for Dockerfiles in the `.actor/Dockerfile` and `Dockerfile` paths, in that order. Refer to the [Dockerfile](./docker.md) section for more information. |
-| `dockerContextDir` | Optional | The path to the directory to be used as the Docker context when building the Actor. The path is relative to the location of the `actor.json` file. This property is useful for monorepos containing multiple Actors. Refer to the [Actor monorepos](../deployment/source_types.md#actor-monorepos) section for more details. |
-| `readme` | Optional | The path to the README file to be used on the platform. If not specified, the system will look for README files in the `.actor/README.md` and `README.md` paths, in that order of preference. Check out [Apify Marketing Playbook to learn how to write a quality README files](https://apify.notion.site/How-to-create-an-Actor-README-759a1614daa54bee834ee39fe4d98bc2) guidance. |
-| `input` | Optional | You can embed your [input schema](./input_schema/index.md) object directly in `actor.json` under the `input` field. You can also provide a path to a custom input schema. If not provided, the input schema at `.actor/INPUT_SCHEMA.json` or `INPUT_SCHEMA.json` is used, in this order of preference. |
-| `changelog` | Optional | The path to the CHANGELOG file displayed in the Information tab of the Actor in Apify Console next to Readme. If not provided, the CHANGELOG at `.actor/CHANGELOG.md` or `CHANGELOG.md` is used, in this order of preference. Your Actor doesn't need to have a CHANGELOG but it is a good practice to keep it updated for published Actors. |
-| `storages.dataset` | Optional | You can define the schema of the items in your dataset under the `storages.dataset` field. This can be either an embedded object or a path to a JSON schema file. [Read more](/platform/actors/development/actor-definition/dataset-schema) about Actor dataset schemas. |
-| `minMemoryMbytes` | Optional | Specifies the minimum amount of memory in megabytes required by the Actor to run. Requires an _integer_ value. If both `minMemoryMbytes` and `maxMemoryMbytes` are set, then `minMemoryMbytes` must be equal or lower than `maxMemoryMbytes`. Refer to the [Usage and resources](https://docs.apify.com/platform/actors/running/usage-and-resources#memory) for more details about memory allocation. |
-| `maxMemoryMbytes` | Optional | Specifies the maximum amount of memory in megabytes required by the Actor to run. It can be used to control the costs of run, especially when developing pay per result Actors. Requires an _integer_ value. Refer to the [Usage and resources](https://docs.apify.com/platform/actors/running/usage-and-resources#memory) for more details about memory allocation. |
-| `usesStandbyMode` | Optional | Boolean specifying whether the Actor will have [Standby mode](../programming_interface/actor_standby.md) enabled. |
-| `webServerSchema` | Optional | Defines an OpenAPI v3 schema for the web server running in the Actor. This can be either an embedded object or a path to a JSON schema file. Use this when your Actor starts its own HTTP server and you want to describe its interface. |
+| Property | Type | Description |
+| ---------------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `actorSpecification` | Required | The version of the Actor specification. This property must be set to `1`, which is the only version available. |
+| `name` | Required | The name of the Actor. |
+| `version` | Required | The version of the Actor, specified in the format `[Number].[Number]`, e.g., `0.1`, `0.3`, `1.0`, `1.3`, etc. |
+| `buildTag` | Optional | The tag name to be applied to a successful build of the Actor. If not specified, defaults to `latest`. Refer to the [builds](../builds_and_runs/builds.md) for more information. |
+| `environmentVariables` | Optional | A map of environment variables to be used during local development. These variables will also be applied to the Actor when deployed on the Apify platform. For more details, see the [environment variables](/cli/docs/vars) section of Apify CLI documentation. |
+| `dockerfile` | Optional | The path to the Dockerfile to be used for building the Actor on the platform. If not specified, the system will search for Dockerfiles in the `.actor/Dockerfile` and `Dockerfile` paths, in that order. Refer to the [Dockerfile](./docker.md) section for more information. |
+| `dockerContextDir` | Optional | The path to the directory to be used as the Docker context when building the Actor. The path is relative to the location of the `actor.json` file. This property is useful for monorepos containing multiple Actors. Refer to the [Actor monorepos](../deployment/source_types.md#actor-monorepos) section for more details. |
+| `readme` | Optional | The path to the README file to be used on the platform. If not specified, the system will look for README files in the `.actor/README.md` and `README.md` paths, in that order of preference. Check out [Apify Marketing Playbook to learn how to write a quality README files](https://apify.notion.site/How-to-create-an-Actor-README-759a1614daa54bee834ee39fe4d98bc2) guidance. |
+| `input` | Optional | You can embed your [input schema](./input_schema/index.md) object directly in `actor.json` under the `input` field. You can also provide a path to a custom input schema. If not provided, the input schema at `.actor/INPUT_SCHEMA.json` or `INPUT_SCHEMA.json` is used, in this order of preference. |
+| `changelog` | Optional | The path to the CHANGELOG file displayed in the Information tab of the Actor in Apify Console next to Readme. If not provided, the CHANGELOG at `.actor/CHANGELOG.md` or `CHANGELOG.md` is used, in this order of preference. Your Actor doesn't need to have a CHANGELOG but it is a good practice to keep it updated for published Actors. |
+| `storages.dataset` | Optional | You can define the schema of the items in your dataset under the `storages.dataset` field. This can be either an embedded object or a path to a JSON schema file. [Read more](/platform/actors/development/actor-definition/dataset-schema) about Actor dataset schemas. |
+| `minMemoryMbytes` | Optional | Specifies the minimum amount of memory in megabytes required by the Actor to run. Requires an _integer_ value. If both `minMemoryMbytes` and `maxMemoryMbytes` are set, then `minMemoryMbytes` must be equal or lower than `maxMemoryMbytes`. Refer to the [Usage and resources](https://docs.apify.com/platform/actors/running/usage-and-resources#memory) for more details about memory allocation. |
+| `maxMemoryMbytes` | Optional | Specifies the maximum amount of memory in megabytes required by the Actor to run. It can be used to control the costs of run, especially when developing pay per result Actors. Requires an _integer_ value. Refer to the [Usage and resources](https://docs.apify.com/platform/actors/running/usage-and-resources#memory) for more details about memory allocation. |
+| `usesStandbyMode` | Optional | Boolean specifying whether the Actor will have [Standby mode](../programming_interface/actor_standby.md) enabled. |
+| `webServerSchema` | Optional | Defines an OpenAPI v3 schema for the web server running in the Actor. This can be either an embedded object or a path to a JSON schema file. Use this when your Actor starts its own HTTP server and you want to describe its interface. |
diff --git a/sources/platform/actors/development/actor_definition/dataset_schema/index.md b/sources/platform/actors/development/actor_definition/dataset_schema/index.md
index 6508bbcf86..d620bfc654 100644
--- a/sources/platform/actors/development/actor_definition/dataset_schema/index.md
+++ b/sources/platform/actors/development/actor_definition/dataset_schema/index.md
@@ -35,7 +35,6 @@ await Actor.pushData({
objectField: {},
});
-
// Exit successfully
await Actor.exit();
```
@@ -197,42 +196,42 @@ The dataset schema structure defines the various components and properties that
### DatasetSchema object definition
-| Property | Type | Required | Description |
-| --- | --- | --- | --- |
-| `actorSpecification` | integer | true | Specifies the version of dataset schema structure document. Currently only version 1 is available. |
-| `fields` | JSONSchema compatible object | true | Schema of one dataset object. Use JsonSchema Draft 2020–12 or other compatible formats. |
-| `views` | DatasetView object | true | An object with a description of an API and UI views. |
+| Property | Type | Required | Description |
+| -------------------- | ---------------------------- | -------- | ------------------------------------------------------------------------------------------------------------ |
+| `actorSpecification` | integer | true | Specifies the version of dataset schema structure document. Currently only version 1 is available. |
+| `fields` | JSONSchema compatible object | true | Schema of one dataset object. Use JsonSchema Draft 2020–12 or other compatible formats. |
+| `views` | DatasetView object | true | An object with a description of an API and UI views. |
### DatasetView object definition
-| Property | Type | Required | Description |
-| --- | --- | --- | --- |
-| `title` | string | true | The title is visible in UI in the Output tab and in the API. |
-| `description` | string | false | The description is only available in the API response. |
-| `transformation` | ViewTransformation object | true | The definition of data transformation applied when dataset data is loaded from Dataset API. |
-| `display` | ViewDisplay object | true | The definition of Output tab UI visualization. |
+| Property | Type | Required | Description |
+| ---------------- | ------------------------- | -------- | ------------------------------------------------------------------------------------------------------ |
+| `title` | string | true | The title is visible in UI in the Output tab and in the API. |
+| `description` | string | false | The description is only available in the API response. |
+| `transformation` | ViewTransformation object | true | The definition of data transformation applied when dataset data is loaded from Dataset API. |
+| `display` | ViewDisplay object | true | The definition of Output tab UI visualization. |
### ViewTransformation object definition
-| Property | Type | Required | Description |
-| --- | --- | --- | --- |
-| `fields` | string[] | true | Selects fields to be presented in the output. The order of fields matches the order of columns in visualization UI. If a field value is missing, it will be presented as **undefined** in the UI. |
-| `unwind` | string[] | false | Deconstructs nested children into parent object, For example, with `unwind:["foo"]`, the object `{"foo": {"bar": "hello"}}` is transformed into `{"bar": "hello"}`. |
-| `flatten` | string[] | false | Transforms nested object into flat structure. For example, with `flatten:["foo"]` the object `{"foo":{"bar": "hello"}}` is transformed into `{"foo.bar": "hello"}`. |
-| `omit` | string[] | false | Removes the specified fields from the output. Nested fields names can be used as well. |
-| `limit` | integer | false | The maximum number of results returned. Default is all results. |
-| `desc` | boolean | false | By default, results are sorted in ascending based on the write event into the dataset. If `desc:true`, the newest writes to the dataset will be returned first. |
+| Property | Type | Required | Description |
+| --------- | -------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `fields` | string[] | true | Selects fields to be presented in the output. The order of fields matches the order of columns in visualization UI. If a field value is missing, it will be presented as **undefined** in the UI. |
+| `unwind` | string[] | false | Deconstructs nested children into parent object, For example, with `unwind:["foo"]`, the object `{"foo": {"bar": "hello"}}` is transformed into `{"bar": "hello"}`. |
+| `flatten` | string[] | false | Transforms nested object into flat structure. For example, with `flatten:["foo"]` the object `{"foo":{"bar": "hello"}}` is transformed into `{"foo.bar": "hello"}`. |
+| `omit` | string[] | false | Removes the specified fields from the output. Nested fields names can be used as well. |
+| `limit` | integer | false | The maximum number of results returned. Default is all results. |
+| `desc` | boolean | false | By default, results are sorted in ascending based on the write event into the dataset. If `desc:true`, the newest writes to the dataset will be returned first. |
### ViewDisplay object definition
-| Property | Type | Required | Description |
-| --- | --- | --- | --- |
-| `component` | string | true | Only the `table` component is available. |
-| `properties` | Object | false | An object with keys matching the `transformation.fields` and `ViewDisplayProperty` as values. If properties are not set, the table will be rendered automatically with fields formatted as `strings`, `arrays` or `objects`. |
+| Property | Type | Required | Description |
+| ------------ | ------ | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `component` | string | true | Only the `table` component is available. |
+| `properties` | Object | false | An object with keys matching the `transformation.fields` and `ViewDisplayProperty` as values. If properties are not set, the table will be rendered automatically with fields formatted as `strings`, `arrays` or `objects`. |
### ViewDisplayProperty object definition
-| Property | Type | Required | Description |
-| --- | --- | --- | --- |
-| `label` | string | false | In the Table view, the label will be visible as the table column's header. |
-| `format` | One of
`text`
`number`
`date`
`link`
`boolean`
`image`
`array`
`object`
| false | Describes how output data values are formatted to be rendered in the Output tab UI. |
+| Property | Type | Required | Description |
+| -------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | ----------------------------------------------------------------------------------- |
+| `label` | string | false | In the Table view, the label will be visible as the table column's header. |
+| `format` | One of
`text`
`number`
`date`
`link`
`boolean`
`image`
`array`
`object`
| false | Describes how output data values are formatted to be rendered in the Output tab UI. |
diff --git a/sources/platform/actors/development/actor_definition/dataset_schema/validation.md b/sources/platform/actors/development/actor_definition/dataset_schema/validation.md
index a4d10edb02..1ade87bde6 100644
--- a/sources/platform/actors/development/actor_definition/dataset_schema/validation.md
+++ b/sources/platform/actors/development/actor_definition/dataset_schema/validation.md
@@ -1,6 +1,6 @@
---
title: Dataset validation
-description: Specify the dataset schema within the Actors so you can add monitoring and validation at the field level.
+description: Specify the dataset schema within the Actors so you can add monitoring and validation at the field level.
slug: /actors/development/actor-definition/dataset-schema/validation
---
@@ -95,10 +95,12 @@ If the data you attempt to store in the dataset is _invalid_ (meaning any of the
"type": "schema-validation-error",
"message": "Schema validation failed",
"data": {
- "invalidItems": [{
- "itemPosition": "",
- "validationErrors": ""
- }]
+ "invalidItems": [
+ {
+ "itemPosition": "",
+ "validationErrors": ""
+ }
+ ]
}
}
}
@@ -200,7 +202,6 @@ In case of enums `null` needs to be within the set of allowed values:
}
```
-
Define type of objects in array:
```json
@@ -245,11 +246,10 @@ When you configure the dataset fields schema, we generate a field list and measu
- **Null count:** how many items in the dataset have the field set to null
- **Empty count:** how many items in the dataset are `undefined` , meaning that for example empty string is not considered empty
- **Minimum and maximum**
- - For numbers, this is calculated directly
- - For strings, this field tracks string length
- - For arrays, this field tracks the number of items in the array
- - For objects, this tracks the number of keys
- - For booleans, this tracks whether the boolean was set to true. Minimum is always 0, but maximum can be either 1 or 0 based on whether at least one item in the dataset has the boolean field set to true.
-
+ - For numbers, this is calculated directly
+ - For strings, this field tracks string length
+ - For arrays, this field tracks the number of items in the array
+ - For objects, this tracks the number of keys
+ - For booleans, this tracks whether the boolean was set to true. Minimum is always 0, but maximum can be either 1 or 0 based on whether at least one item in the dataset has the boolean field set to true.
You can use them in [monitoring](../../../../monitoring#alert-configuration).
diff --git a/sources/platform/actors/development/actor_definition/docker.md b/sources/platform/actors/development/actor_definition/docker.md
index 734540c7a1..7112728fd1 100644
--- a/sources/platform/actors/development/actor_definition/docker.md
+++ b/sources/platform/actors/development/actor_definition/docker.md
@@ -28,14 +28,14 @@ All Apify Docker images are pre-cached on Apify servers to speed up Actor builds
These images come with Node.js (versions `20`, `22`, or `24`) the [Apify SDK for JavaScript](/sdk/js), and [Crawlee](https://crawlee.dev/) preinstalled. The `latest` tag corresponds to the latest LTS version of Node.js.
-| Image | Description |
-| ----- | ----------- |
-| [`actor-node`](https://hub.docker.com/r/apify/actor-node/) | Slim Alpine Linux image with only essential tools. Does not include headless browsers. |
-| [`actor-node-puppeteer-chrome`](https://hub.docker.com/r/apify/actor-node-puppeteer-chrome/) | Debian image with Chromium, Google Chrome, and the [`puppeteer`](https://github.com/puppeteer/puppeteer) library. |
-| [`actor-node-playwright-chrome`](https://hub.docker.com/r/apify/actor-node-playwright-chrome/) | Debian image with Chromium, Google Chrome, and the [`playwright`](https://github.com/microsoft/playwright) library. |
-| [`actor-node-playwright-firefox`](https://hub.docker.com/r/apify/actor-node-playwright-firefox/) | Debian image with Firefox and the [`playwright`](https://github.com/microsoft/playwright) library . |
-| [`actor-node-playwright-webkit`](https://hub.docker.com/r/apify/actor-node-playwright-webkit/) | Ubuntu image with WebKit and the [`playwright`](https://github.com/microsoft/playwright) library. |
-| [`actor-node-playwright`](https://hub.docker.com/r/apify/actor-node-playwright/) | Ubuntu image with [`playwright`](https://github.com/microsoft/playwright) and all its browsers (Chromium, Google Chrome, Firefox, WebKit). |
+| Image | Description |
+| ------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------ |
+| [`actor-node`](https://hub.docker.com/r/apify/actor-node/) | Slim Alpine Linux image with only essential tools. Does not include headless browsers. |
+| [`actor-node-puppeteer-chrome`](https://hub.docker.com/r/apify/actor-node-puppeteer-chrome/) | Debian image with Chromium, Google Chrome, and the [`puppeteer`](https://github.com/puppeteer/puppeteer) library. |
+| [`actor-node-playwright-chrome`](https://hub.docker.com/r/apify/actor-node-playwright-chrome/) | Debian image with Chromium, Google Chrome, and the [`playwright`](https://github.com/microsoft/playwright) library. |
+| [`actor-node-playwright-firefox`](https://hub.docker.com/r/apify/actor-node-playwright-firefox/) | Debian image with Firefox and the [`playwright`](https://github.com/microsoft/playwright) library . |
+| [`actor-node-playwright-webkit`](https://hub.docker.com/r/apify/actor-node-playwright-webkit/) | Ubuntu image with WebKit and the [`playwright`](https://github.com/microsoft/playwright) library. |
+| [`actor-node-playwright`](https://hub.docker.com/r/apify/actor-node-playwright/) | Ubuntu image with [`playwright`](https://github.com/microsoft/playwright) and all its browsers (Chromium, Google Chrome, Firefox, WebKit). |
See the [Docker image guide](/sdk/js/docs/guides/docker-images) for more details.
@@ -43,11 +43,11 @@ See the [Docker image guide](/sdk/js/docs/guides/docker-images) for more details
These images come with Python (version `3.9`, `3.10`, `3.11`, `3.12`, or `3.13`) and the [Apify SDK for Python](/sdk/python) preinstalled. The `latest` tag corresponds to the latest Python 3 version supported by the Apify SDK.
-| Image | Description |
-| ----- | ----------- |
-| [`actor-python`](https://hub.docker.com/r/apify/actor-python) | Slim Debian image with only the Apify SDK for Python. Does not include headless browsers. |
-| [`actor-python-playwright`](https://hub.docker.com/r/apify/actor-python-playwright) | Debian image with [`playwright`](https://github.com/microsoft/playwright) and all its browsers. |
-| [`actor-python-selenium`](https://hub.docker.com/r/apify/actor-python-selenium) | Debian image with [`selenium`](https://github.com/seleniumhq/selenium), Google Chrome, and [ChromeDriver](https://developer.chrome.com/docs/chromedriver/). |
+| Image | Description |
+| ----------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| [`actor-python`](https://hub.docker.com/r/apify/actor-python) | Slim Debian image with only the Apify SDK for Python. Does not include headless browsers. |
+| [`actor-python-playwright`](https://hub.docker.com/r/apify/actor-python-playwright) | Debian image with [`playwright`](https://github.com/microsoft/playwright) and all its browsers. |
+| [`actor-python-selenium`](https://hub.docker.com/r/apify/actor-python-selenium) | Debian image with [`selenium`](https://github.com/seleniumhq/selenium), Google Chrome, and [ChromeDriver](https://developer.chrome.com/docs/chromedriver/). |
## Custom Dockerfile
diff --git a/sources/platform/actors/development/actor_definition/input_schema/index.md b/sources/platform/actors/development/actor_definition/input_schema/index.md
index ad7b50024c..6f97b78ba8 100644
--- a/sources/platform/actors/development/actor_definition/input_schema/index.md
+++ b/sources/platform/actors/development/actor_definition/input_schema/index.md
@@ -19,43 +19,47 @@ With an input schema defined as follows:
```json5
{
- "title": "Input schema for Website Content Crawler",
- "description": "Enter the start URL(s) of the website(s) to crawl, configure other optional settings, and run the Actor to crawl the pages and extract their text content.",
- "type": "object",
- "schemaVersion": 1,
- "properties": {
- "startUrls": {
- "title": "Start URLs",
- "type": "array",
- "description": "One or more URLs of the pages where the crawler will start. Note that the Actor will additionally only crawl sub-pages of these URLs. For example, for the start URL `https://www.example.com/blog`, it will crawl pages like `https://example.com/blog/article-1`, but will skip `https://example.com/docs/something-else`.",
- "editor": "requestListSources",
- "prefill": [{ "url": "https://docs.apify.com/" }]
+ title: 'Input schema for Website Content Crawler',
+ description: 'Enter the start URL(s) of the website(s) to crawl, configure other optional settings, and run the Actor to crawl the pages and extract their text content.',
+ type: 'object',
+ schemaVersion: 1,
+ properties: {
+ startUrls: {
+ title: 'Start URLs',
+ type: 'array',
+ description: 'One or more URLs of the pages where the crawler will start. Note that the Actor will additionally only crawl sub-pages of these URLs. For example, for the start URL `https://www.example.com/blog`, it will crawl pages like `https://example.com/blog/article-1`, but will skip `https://example.com/docs/something-else`.',
+ editor: 'requestListSources',
+ prefill: [{ url: 'https://docs.apify.com/' }],
},
- "crawlerType": {
- "sectionCaption": "Crawler settings",
- "title": "Crawler type",
- "type": "string",
- "enum": ["playwright:chrome", "cheerio", "jsdom"],
- "enumTitles": ["Headless web browser (Chrome+Playwright)", "Raw HTTP client (Cheerio)", "Raw HTTP client with JS execution (JSDOM) (experimental!)"],
- "description": "Select the crawling engine:\n- **Headless web browser** (default) - Useful for modern websites with anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and automatically retries blocked requests through new sessions. However, running web browsers is more expensive as it requires more computing resources and is slower. It is recommended to use at least 8 GB of RAM.\n- **Raw HTTP client** - High-performance crawling mode that uses raw HTTP requests to fetch the pages. It is faster and cheaper, but it might not work on all websites.",
- "default": "playwright:chrome"
+ crawlerType: {
+ sectionCaption: 'Crawler settings',
+ title: 'Crawler type',
+ type: 'string',
+ enum: ['playwright:chrome', 'cheerio', 'jsdom'],
+ enumTitles: [
+ 'Headless web browser (Chrome+Playwright)',
+ 'Raw HTTP client (Cheerio)',
+ 'Raw HTTP client with JS execution (JSDOM) (experimental!)',
+ ],
+ description: 'Select the crawling engine:\n- **Headless web browser** (default) - Useful for modern websites with anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and automatically retries blocked requests through new sessions. However, running web browsers is more expensive as it requires more computing resources and is slower. It is recommended to use at least 8 GB of RAM.\n- **Raw HTTP client** - High-performance crawling mode that uses raw HTTP requests to fetch the pages. It is faster and cheaper, but it might not work on all websites.',
+ default: 'playwright:chrome',
},
- "maxCrawlDepth": {
- "title": "Max crawling depth",
- "type": "integer",
- "description": "The maximum number of links starting from the start URL that the crawler will recursively descend. The start URLs have a depth of 0, the pages linked directly from the start URLs have a depth of 1, and so on.\n\nThis setting is useful to prevent accidental crawler runaway. By setting it to 0, the Actor will only crawl start URLs.",
- "minimum": 0,
- "default": 20
+ maxCrawlDepth: {
+ title: 'Max crawling depth',
+ type: 'integer',
+ description: 'The maximum number of links starting from the start URL that the crawler will recursively descend. The start URLs have a depth of 0, the pages linked directly from the start URLs have a depth of 1, and so on.\n\nThis setting is useful to prevent accidental crawler runaway. By setting it to 0, the Actor will only crawl start URLs.',
+ minimum: 0,
+ default: 20,
},
- "maxCrawlPages": {
- "title": "Max pages",
- "type": "integer",
- "description": "The maximum number pages to crawl. It includes the start URLs, pagination pages, pages with no content, etc. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway.",
- "minimum": 0,
- "default": 9999999
+ maxCrawlPages: {
+ title: 'Max pages',
+ type: 'integer',
+ description: 'The maximum number pages to crawl. It includes the start URLs, pagination pages, pages with no content, etc. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway.',
+ minimum: 0,
+ default: 9999999,
},
// ...
- }
+ },
}
```
diff --git a/sources/platform/actors/development/actor_definition/input_schema/secret_input.md b/sources/platform/actors/development/actor_definition/input_schema/secret_input.md
index bfde32d4c9..98a533b26f 100644
--- a/sources/platform/actors/development/actor_definition/input_schema/secret_input.md
+++ b/sources/platform/actors/development/actor_definition/input_schema/secret_input.md
@@ -15,6 +15,7 @@ The secret input feature lets you mark specific input fields of an Actor as sens
To make an input field secret, you need to add a `"isSecret": true` setting to the input field in the Actor's [input schema](./index.md), like this:
+
```json
{
// ...
@@ -26,9 +27,9 @@ To make an input field secret, you need to add a `"isSecret": true` setting to t
"description": "A secret, encrypted input field",
"editor": "textfield",
"isSecret": true
- },
+ }
// ...
- },
+ }
// ...
}
```
@@ -54,6 +55,7 @@ This feature supports `string`, `object`, and `array` input types. Available edi
When you read the Actor input through `Actor.getInput()`, the encrypted fields are automatically decrypted. Decryption of string fields is supported since [JavaScript SDK](https://docs.apify.com/sdk/js/) 3.1.0; support for objects and arrays was added in [JavaScript SDK](https://docs.apify.com/sdk/js/) 3.4.2 and [Python SDK](https://docs.apify.com/sdk/python/) 2.7.0.
+
```js
> await Actor.getInput();
{
@@ -65,6 +67,7 @@ When you read the Actor input through `Actor.getInput()`, the encrypted fields a
If you read the `INPUT` key from the Actor run's default key-value store directly, you will still get the original, encrypted input value.
+
```js
> await Actor.getValue('INPUT');
{
@@ -81,7 +84,6 @@ The RSA key is unique for each combination of user and Actor, ensuring that no A
During Actor execution, the decryption keys are passed as environment variables, restricting the decryption of secret input fields to occur solely within the context of the Actor run. This approach prevents unauthorized access to sensitive input data outside the Actor's execution environment.
-
## Example Actor
If you want to test the secret input live, check out the [Example Secret Input](https://console.apify.com/actors/O3S2UlSKzkcnFHRRA) Actor in Apify Console.
diff --git a/sources/platform/actors/development/actor_definition/input_schema/specification.md b/sources/platform/actors/development/actor_definition/input_schema/specification.md
index 146a27daa7..1a57dbc7f4 100644
--- a/sources/platform/actors/development/actor_definition/input_schema/specification.md
+++ b/sources/platform/actors/development/actor_definition/input_schema/specification.md
@@ -20,7 +20,6 @@ The Actor input schema file is used to:
- Generate Actor API documentation and integration code examples on the web or in CLI, making Actors easy to integrate for users.
- Simplify integration of Actors into automation workflows such as Zapier or Make, by providing smart connectors that smartly pre-populate and link Actor input properties.
-
To define an input schema for an Actor, set `input` field in the `.actor/actor.json` file to an input schema object (described below), or path to a JSON file containing the input schema object.
For backwards compatibility, if the `input` field is omitted, the system looks for an `INPUT_SCHEMA.json` file either in the `.actor` directory or the Actor's top-level directory—but note that this functionality is deprecated and might be removed in the future. The maximum allowed size for the input schema file is 500 kB.
@@ -43,30 +42,27 @@ Imagine a simple web crawler that accepts an array of start URLs and a JavaScrip
```json5
{
- "title": "Cheerio Crawler input",
- "description": "To update crawler to another site, you need to change startUrls and pageFunction options!",
- "type": "object",
- "schemaVersion": 1,
- "properties": {
- "startUrls": {
- "title": "Start URLs",
- "type": "array",
- "description": "URLs to start with",
- "prefill": [
- { "url": "http://example.com" },
- { "url": "http://example.com/some-path" }
- ],
- "editor": "requestListSources"
+ title: 'Cheerio Crawler input',
+ description: 'To update crawler to another site, you need to change startUrls and pageFunction options!',
+ type: 'object',
+ schemaVersion: 1,
+ properties: {
+ startUrls: {
+ title: 'Start URLs',
+ type: 'array',
+ description: 'URLs to start with',
+ prefill: [{ url: 'http://example.com' }, { url: 'http://example.com/some-path' }],
+ editor: 'requestListSources',
+ },
+ pageFunction: {
+ title: 'Page function',
+ type: 'string',
+ description: 'Function executed for each request',
+ prefill: "async () => { return $('title').text(); }",
+ editor: 'javascript',
},
- "pageFunction": {
- "title": "Page function",
- "type": "string",
- "description": "Function executed for each request",
- "prefill": "async () => { return $('title').text(); }",
- "editor": "javascript"
- }
},
- "required": ["startUrls", "pageFunction"]
+ required: ['startUrls', 'pageFunction'],
}
```
@@ -79,12 +75,12 @@ If you switch the input to the **JSON** display using the toggle, then you will
```json
{
"startUrls": [
- {
- "url": "http://example.com"
- },
- {
- "url": "http://example.com/some-path"
- }
+ {
+ "url": "http://example.com"
+ },
+ {
+ "url": "http://example.com/some-path"
+ }
],
"pageFunction": "async () => { return $('title').text(); }"
}
@@ -97,20 +93,22 @@ If you switch the input to the **JSON** display using the toggle, then you will
"title": "Cheerio Crawler input",
"type": "object",
"schemaVersion": 1,
- "properties": { /* define input fields here */ },
+ "properties": {
+ /* define input fields here */
+ },
"required": []
}
```
-| Property | Type | Required | Description |
-| --- | --- | --- | --- |
-| `title` | String | Yes | Any text describing your input schema. |
-| `description` | String | No | Help text for the input that will be displayed above the UI fields. |
-| `type` | String | Yes | This is fixed and must be set to string `object`. |
-| `schemaVersion` | Integer | Yes | The version of the input schema specification against which your schema is written. Currently, only version `1` is out. |
-| `properties` | Object | Yes | This is an object mapping each field key to its specification. |
-| `required` | String | No | An array of field keys that are required. |
-| `additionalProperties` | Boolean | No | Controls if properties not listed in `properties` are allowed. Defaults to `true`. Set to `false` to make requests with extra properties fail. |
+| Property | Type | Required | Description |
+| ---------------------- | ------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `title` | String | Yes | Any text describing your input schema. |
+| `description` | String | No | Help text for the input that will be displayed above the UI fields. |
+| `type` | String | Yes | This is fixed and must be set to string `object`. |
+| `schemaVersion` | Integer | Yes | The version of the input schema specification against which your schema is written. Currently, only version `1` is out. |
+| `properties` | Object | Yes | This is an object mapping each field key to its specification. |
+| `required` | String | No | An array of field keys that are required. |
+| `additionalProperties` | Boolean | No | Controls if properties not listed in `properties` are allowed. Defaults to `true`. Set to `false` to make requests with extra properties fail. |
:::note Input schema differences
@@ -122,16 +120,16 @@ Even though the structure of the Actor input schema is similar to JSON schema, t
Each field of your input is described under its key in the `inputSchema.properties` object. The field might have `integer`, `string`, `array`, `object`, or `boolean` type, and its specification contains the following properties:
-| Property | Value | Required | Description |
-| --- | --- | --- | --- |
-| `type` | One of
`string`
`array`
`object`
`boolean`
`integer`
| Yes | Allowed type for the input value. Cannot be mixed. |
-| `title` | String | Yes | Title of the field in UI. |
-| `description` | String | Yes | Description of the field that will be displayed as help text in Actor input UI. |
-| `default` | Must match `type` property. | No | Default value that will be used when no value is provided. |
-| `prefill` | Must match `type` property. | No | Value that will be prefilled in the Actor input interface. |
-| `example` | Must match `type` property. | No | Sample value of this field for the Actor to be displayed when Actor is published in Apify Store. |
-| `sectionCaption` | String | No | If this property is set, then all fields following this field (this field included) will be separated into a collapsible section with the value set as its caption. The section ends at the last field or the next field which has the `sectionCaption` property set. |
-| `sectionDescription` | String | No | If the `sectionCaption` property is set, then you can use this property to provide additional description to the section. The description will be visible right under the caption when the section is open. |
+| Property | Value | Required | Description |
+| -------------------- | ------------------------------------------------------------------------------------------------------ | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `type` | One of
`string`
`array`
`object`
`boolean`
`integer`
| Yes | Allowed type for the input value. Cannot be mixed. |
+| `title` | String | Yes | Title of the field in UI. |
+| `description` | String | Yes | Description of the field that will be displayed as help text in Actor input UI. |
+| `default` | Must match `type` property. | No | Default value that will be used when no value is provided. |
+| `prefill` | Must match `type` property. | No | Value that will be prefilled in the Actor input interface. |
+| `example` | Must match `type` property. | No | Sample value of this field for the Actor to be displayed when Actor is published in Apify Store. |
+| `sectionCaption` | String | No | If this property is set, then all fields following this field (this field included) will be separated into a collapsible section with the value set as its caption. The section ends at the last field or the next field which has the `sectionCaption` property set. |
+| `sectionDescription` | String | No | If the `sectionCaption` property is set, then you can use this property to provide additional description to the section. The description will be visible right under the caption when the section is open. |
### Prefill vs. default vs. required
@@ -145,7 +143,6 @@ Here is a rule of thumb for whether an input field should have a `prefill`, `def
In summary, you can use each option independently or use a combination of **Prefill + Required** or **Prefill + Default**, but the combination of **Default + Required** doesn't make sense to use.
-
## Input types
Most types also support additional properties defining, for example, the UI input editor.
@@ -154,17 +151,17 @@ Most types also support additional properties defining, for example, the UI inpu
String is the most common input field type, and provide a number of editors and validations properties:
-| Property | Value | Required | Description |
-|----------|--------|-----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `editor` | One of: - `textfield` - `textarea` - `javascript` - `python` - `select` - `datepicker` - `fileupload` - `hidden` | Yes | Visual editor used for the input field. |
-| `pattern` | String | No | Regular expression that will be used to validate the input. If validation fails, the Actor will not run. |
-| `minLength` | Integer | No | Minimum length of the string. |
-| `maxLength` | Integer | No | Maximum length of the string. |
-| `enum` | [String] | Required if `editor` is `select` | Using this field, you can limit values to the given array of strings. Input will be displayed as select box. |
-| `enumTitles` | [String] | No | Titles for the `enum` keys described. |
-| `nullable` | Boolean | No | Specifies whether `null` is an allowed value. |
-| `isSecret` | Boolean | No | Specifies whether the input field will be stored encrypted. Only available with `textfield`, `textarea` and `hidden` editors. |
-| `dateType` | One of
`absolute`
`relative`
`absoluteOrRelative`
| No | This property, which is only available with `datepicker` editor, specifies what date format should visual editor accept (The JSON editor accepts any string without validation.).
`absolute` value enables date input in `YYYY-MM-DD` format. To parse returned string regex like this can be used: `^(\d{4})-(0[1-9]\|1[0-2])-(0[1-9]\|[12]\d\|3[01])$`.
`relative` value enables relative date input in `{number} {unit}` format. Supported units are: days, weeks, months, years.
The input is passed to the Actor as plain text (e.g., "3 weeks"). To parse it, regex like this can be used: `^(\d+)\s*(day\|week\|month\|year)s?$`.
`absoluteOrRelative` value enables both absolute and relative formats and user can switch between them. It's up to Actor author to parse a determine actual used format - regexes above can be used to check whether the returned string match one of them.
Defaults to `absolute`. |
+| Property | Value | Required | Description |
+| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------ | -------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `editor` | One of: - `textfield` - `textarea` - `javascript` - `python` - `select` - `datepicker` - `fileupload` - `hidden` | Yes | Visual editor used for the input field. |
+| `pattern` | String | No | Regular expression that will be used to validate the input. If validation fails, the Actor will not run. |
+| `minLength` | Integer | No | Minimum length of the string. |
+| `maxLength` | Integer | No | Maximum length of the string. |
+| `enum` | [String] | Required if `editor` is `select` | Using this field, you can limit values to the given array of strings. Input will be displayed as select box. |
+| `enumTitles` | [String] | No | Titles for the `enum` keys described. |
+| `nullable` | Boolean | No | Specifies whether `null` is an allowed value. |
+| `isSecret` | Boolean | No | Specifies whether the input field will be stored encrypted. Only available with `textfield`, `textarea` and `hidden` editors. |
+| `dateType` | One of
`absolute`
`relative`
`absoluteOrRelative`
| No | This property, which is only available with `datepicker` editor, specifies what date format should visual editor accept (The JSON editor accepts any string without validation.).
`absolute` value enables date input in `YYYY-MM-DD` format. To parse returned string regex like this can be used: `^(\d{4})-(0[1-9]\|1[0-2])-(0[1-9]\|[12]\d\|3[01])$`.
`relative` value enables relative date input in `{number} {unit}` format. Supported units are: days, weeks, months, years.
The input is passed to the Actor as plain text (e.g., "3 weeks"). To parse it, regex like this can be used: `^(\d+)\s*(day\|week\|month\|year)s?$`.
`absoluteOrRelative` value enables both absolute and relative formats and user can switch between them. It's up to Actor author to parse a determine actual used format - regexes above can be used to check whether the returned string match one of them.
Defaults to `absolute`. |
:::note Regex escape
@@ -193,7 +190,6 @@ The `select` editor is rendered as drop-down in user interface:

-
#### Code editor
If the input string is code, you can use either `javascript` or `python` editor
@@ -215,7 +211,6 @@ Rendered input:

-
#### Date picker
Example of date selection using absolute and relative `datepicker` editor:
@@ -268,7 +263,6 @@ While the `datepicker` editor doesn't support setting time values visually, you
When implementing time-based fields, make sure to explain to your users through the description that the time values should be provided in UTC. This helps prevent timezone-related issues.
-
#### File upload
The `fileupload` editor enables users to specify a file as input. The input is passed to the Actor as a string. It is the Actor author's responsibility to interpret this string, including validating its existence and format.
@@ -281,7 +275,6 @@ The user provides either a URL or uploads the file to a key-value store (existin

-
### Boolean type
Example options with group caption:
@@ -312,12 +305,12 @@ Rendered input:
Properties:
-| Property | Value | Required | Description |
-| --- | --- | --- | --- |
-| `editor` | One of
`checkbox`
`hidden`
| No | Visual editor used for the input field. |
-| `groupCaption` | String | No | If you want to group multiple checkboxes together, add this option to the first of the group. |
-| `groupDescription` | String | No | Description displayed as help text displayed of group title. |
-| `nullable` | Boolean | No | Specifies whether null is an allowed value. |
+| Property | Value | Required | Description |
+| ------------------ | ---------------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------ |
+| `editor` | One of
`checkbox`
`hidden`
| No | Visual editor used for the input field. |
+| `groupCaption` | String | No | If you want to group multiple checkboxes together, add this option to the first of the group. |
+| `groupDescription` | String | No | Description displayed as help text displayed of group title. |
+| `nullable` | Boolean | No | Specifies whether null is an allowed value. |
### Numeric types
@@ -346,11 +339,11 @@ Rendered input:
Properties:
| Property | Value | Required | Description |
-|------------|-----------------------------------------------------|----------|-------------------------------------------------------------------------------|
-| `type` | One of
`integer`
`number`
| Yes | Defines the type of the field — either an integer or a floating-point number. |
+| ---------- | --------------------------------------------------- | -------- | ----------------------------------------------------------------------------- |
+| `type` | One of
`integer`
`number`
| Yes | Defines the type of the field — either an integer or a floating-point number. |
| `editor` | One of:
`number`
`hidden`
| No | Visual editor used for input field. |
| `maximum` | Integer or Number (based on the `type`) | No | Maximum allowed value. |
-| `minimum` | Integer or Number (based on the `type`) | No | Minimum allowed value. |
+| `minimum` | Integer or Number (based on the `type`) | No | Minimum allowed value. |
| `unit` | String | No | Unit displayed next to the field in UI, for example _second_, _MB_, etc. |
| `nullable` | Boolean | No | Specifies whether null is an allowed value. |
@@ -412,7 +405,7 @@ Rendered input:
Properties:
| Property | Value | Required | Description |
-|------------------------|----------------------------------------------------------------------------------------|----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| ---------------------- | -------------------------------------------------------------------------------------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `editor` | One of
`json`
`proxy`
`schemaBased`
`hidden`
| Yes | UI editor used for input. |
| `patternKey` | String | No | Regular expression that will be used to validate the keys of the object. |
| `patternValue` | String | No | Regular expression that will be used to validate the values of object. |
@@ -619,7 +612,7 @@ Rendered input:
Properties:
| Property | Value | Required | Description |
-|--------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `editor` | One of
`json`
`requestListSources`
`pseudoUrls`
`globs`
`keyValue`
`stringList`
`fileupload`
`select`
`schemaBased`
`hidden`
| Yes | UI editor used for input. |
| `placeholderKey` | String | No | Placeholder displayed for key field when no value is specified. Works only with `keyValue` editor. |
| `placeholderValue` | String | No | Placeholder displayed in value field when no value is provided. Works only with `keyValue` and `stringList` editors. |
@@ -632,7 +625,6 @@ Properties:
| `items` | object | No | Specifies format of the items of the array, useful mainly for multiselect and for `schemaBased` editor (see below). |
| `isSecret` | Boolean | No | Specifies whether the input field will be stored encrypted. Only available with `json` and `hidden` editors. |
-
Usage of this field is based on the selected editor:
- `requestListSources` - value from this field can be used as input for the [RequestList](https://crawlee.dev/api/core/class/RequestList) class from Crawlee.
@@ -728,9 +720,7 @@ For example, having an input schema like this:
"type": "array",
"description": "List of HTTP requests",
"editor": "schemaBased",
- "default": [
- { "url": "https://apify.com", "port": 80 }
- ],
+ "default": [{ "url": "https://apify.com", "port": 80 }],
"items": {
"type": "object",
"properties": {
@@ -757,9 +747,7 @@ For example, having an input schema like this:
If there is no value specified for the field, the array will default to containing one object:
```json
-[
- { "url": "https://apify.com", "port": 80 }
-]
+[{ "url": "https://apify.com", "port": 80 }]
```
However, if the user adds a new item to the array, the `port` sub-property of that new object will default to `8080`, as defined in the sub-property itself.
@@ -883,37 +871,36 @@ Rendered input:
#### Single value properties
-| Property | Value | Required | Description |
-|----------------|-----------------------------------------------------------------------------------|----------|----------------------------------------------------------------------------------------------------------|
-| `type` | `string` | Yes | Specifies the type of input - `string` for single value. |
-| `editor` | One of
`resourcePicker`
`textfield`
`hidden`
| No | Visual editor used for the input field. Defaults to `resourcePicker`. |
-| `resourceType` | One of
`dataset`
`keyValueStore`
`requestQueue`
| Yes | Type of Apify Platform resource |
-| `resourcePermissions` | Array of strings; allowed values:
`READ`
`WRITE`
| Yes | Permissions requested for the referenced resource. Use [\"READ\"] for read-only access, or [\"READ\", \"WRITE\"] to allow writes.|
-| `pattern` | String | No | Regular expression that will be used to validate the input. If validation fails, the Actor will not run. |
-| `minLength` | Integer | No | Minimum length of the string. |
-| `maxLength` | Integer | No | Maximum length of the string. |
+| Property | Value | Required | Description |
+| --------------------- | --------------------------------------------------------------------------------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- |
+| `type` | `string` | Yes | Specifies the type of input - `string` for single value. |
+| `editor` | One of
`resourcePicker`
`textfield`
`hidden`
| No | Visual editor used for the input field. Defaults to `resourcePicker`. |
+| `resourceType` | One of
`dataset`
`keyValueStore`
`requestQueue`
| Yes | Type of Apify Platform resource |
+| `resourcePermissions` | Array of strings; allowed values:
`READ`
`WRITE`
| Yes | Permissions requested for the referenced resource. Use [\"READ\"] for read-only access, or [\"READ\", \"WRITE\"] to allow writes. |
+| `pattern` | String | No | Regular expression that will be used to validate the input. If validation fails, the Actor will not run. |
+| `minLength` | Integer | No | Minimum length of the string. |
+| `maxLength` | Integer | No | Maximum length of the string. |
#### Multiple values properties
-| Property | Value | Required | Description |
-|----------------|-----------------------------------------------------------------------------------|----------|----------------------------------------------------------------------------|
-| `type` | `array` | Yes | Specifies the type of input - `array` for multiple values. |
-| `editor` | One of
`resourcePicker`
`hidden`
| No | Visual editor used for the input field. Defaults to `resourcePicker`. |
-| `resourceType` | One of
`dataset`
`keyValueStore`
`requestQueue`
| Yes | Type of Apify Platform resource |
-| `resourcePermissions` | Array of strings; allowed values:
`READ`
`WRITE`
| Yes | Permissions requested for the referenced resources. Use [\"READ\"] for read-only access, or [\"READ\", \"WRITE\"] to allow writes. Applies to each selected resource. |
-| `minItems` | Integer | No | Minimum number of items the array can contain. |
-| `maxItems` | Integer | No | Maximum number of items the array can contain. |
+| Property | Value | Required | Description |
+| --------------------- | --------------------------------------------------------------------------------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `type` | `array` | Yes | Specifies the type of input - `array` for multiple values. |
+| `editor` | One of
`resourcePicker`
`hidden`
| No | Visual editor used for the input field. Defaults to `resourcePicker`. |
+| `resourceType` | One of
`dataset`
`keyValueStore`
`requestQueue`
| Yes | Type of Apify Platform resource |
+| `resourcePermissions` | Array of strings; allowed values:
`READ`
`WRITE`
| Yes | Permissions requested for the referenced resources. Use [\"READ\"] for read-only access, or [\"READ\", \"WRITE\"] to allow writes. Applies to each selected resource. |
+| `minItems` | Integer | No | Minimum number of items the array can contain. |
+| `maxItems` | Integer | No | Maximum number of items the array can contain. |
#### Resource permissions
If your Actor runs with limited permissions, it must declare what access it needs to resources supplied via input. The `resourcePermissions` field defines which operations your Actor can perform on user-selected storages. This field is evaluated at run start and expands the Actor's [limited permissions](../../permissions/index.md) scope to access resources sent via input.
- `["READ"]` - The Actor can read from the referenced resources.
-- `["READ", "WRITE"]` - The Actor can read from and write to the referenced resources.
+- `["READ", "WRITE"]` - The Actor can read from and write to the referenced resources.
:::note Runtime behavior
This setting defines runtime access only and doesn't change field visibility or whether the field is required in the UI. For array fields (`type: array`), the same permissions apply to each selected resource. Your Actor's run will fail with an insufficient-permissions error if it attempts an operation without the required permission, such as writing with read-only access. Users can see the required permissions in the [input field's tooltip](../../../running/permissions.md#recognizing-permission-levels-in-console-and-store).
-
:::
diff --git a/sources/platform/actors/development/actor_definition/key_value_store_schema/index.md b/sources/platform/actors/development/actor_definition/key_value_store_schema/index.md
index c2f8f6006e..80450f5bd2 100644
--- a/sources/platform/actors/development/actor_definition/key_value_store_schema/index.md
+++ b/sources/platform/actors/development/actor_definition/key_value_store_schema/index.md
@@ -24,11 +24,15 @@ await Actor.init();
/**
* Actor code
*/
-await Actor.setValue('document-1', 'my text data', { contentType: 'text/plain' });
+await Actor.setValue('document-1', 'my text data', {
+ contentType: 'text/plain',
+});
// ...
-await Actor.setValue(`image-${imageID}`, imageBuffer, { contentType: 'image/jpeg' });
+await Actor.setValue(`image-${imageID}`, imageBuffer, {
+ contentType: 'image/jpeg',
+});
// Exit successfully
await Actor.exit();
@@ -92,22 +96,22 @@ Example response:
```json
{
- "data": {
- "items": [
- {
- "key": "document-1",
- "size": 254
- },
- {
- "key": "document-2",
- "size": 368
- }
- ],
- "count": 2,
- "limit": 1000,
- "exclusiveStartKey": null,
- "isTruncated": false
- }
+ "data": {
+ "items": [
+ {
+ "key": "document-1",
+ "size": 254
+ },
+ {
+ "key": "document-2",
+ "size": 368
+ }
+ ],
+ "count": 2,
+ "limit": 1000,
+ "exclusiveStartKey": null,
+ "isTruncated": false
+ }
}
```
@@ -143,7 +147,9 @@ You have two choices of how to organize files within the `.actor` folder.
"keyValueStore": {
"actorKeyValueStoreSchemaVersion": 1,
"title": "Key-Value Store Schema",
- "collections": { /* Define your collections here */ }
+ "collections": {
+ /* Define your collections here */
+ }
}
}
}
@@ -167,7 +173,9 @@ You have two choices of how to organize files within the `.actor` folder.
{
"actorKeyValueStoreSchemaVersion": 1,
"title": "Key-Value Store Schema",
- "collections": { /* Define your collections here */ }
+ "collections": {
+ /* Define your collections here */
+ }
}
```
@@ -179,22 +187,22 @@ The key-value store schema defines the collections of keys and their properties.
### Key-value store schema object definition
-| Property | Type | Required | Description |
-|-----------------------------------|-------------------------------|----------|-----------------------------------------------------------------------------------------------------------------|
-| `actorKeyValueStoreSchemaVersion` | integer | true | Specifies the version of key-value store schema structure document. Currently only version 1 is available. |
-| `title` | string | true | Title of the schema |
-| `description` | string | false | Description of the schema |
-| `collections` | Object | true | An object where each key is a collection ID and its value is a collection definition object (see below). |
+| Property | Type | Required | Description |
+| --------------------------------- | ------- | -------- | --------------------------------------------------------------------------------------------------------------- |
+| `actorKeyValueStoreSchemaVersion` | integer | true | Specifies the version of key-value store schema structure document. Currently only version 1 is available. |
+| `title` | string | true | Title of the schema |
+| `description` | string | false | Description of the schema |
+| `collections` | Object | true | An object where each key is a collection ID and its value is a collection definition object (see below). |
### Collection object definition
-| Property | Type | Required | Description |
-|----------------|--------------|--------------|-------------------------------------------------------------------------------------------------------------------------------------------------|
-| `title` | string | true | The collection’s title, shown in the run's storage tab and in the storage detail view, where it appears as a tab for filtering records. |
-| `description` | string | false | A description of the collection that appears in UI tooltips. |
-| `key` | string | conditional* | Defines a single specific key that will be part of this collection. |
-| `keyPrefix` | string | conditional* | Defines a prefix for keys that should be included in this collection. |
-| `contentTypes` | string array | false | Allowed content types for records in this collection. Used for validation when storing data. |
-| `jsonSchema` | object | false | For collections with content type `application/json`, you can define a JSON schema to validate structure. Uses JSON Schema Draft 07 format. |
+| Property | Type | Required | Description |
+| -------------- | ------------ | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `title` | string | true | The collection’s title, shown in the run's storage tab and in the storage detail view, where it appears as a tab for filtering records. |
+| `description` | string | false | A description of the collection that appears in UI tooltips. |
+| `key` | string | conditional\* | Defines a single specific key that will be part of this collection. |
+| `keyPrefix` | string | conditional\* | Defines a prefix for keys that should be included in this collection. |
+| `contentTypes` | string array | false | Allowed content types for records in this collection. Used for validation when storing data. |
+| `jsonSchema` | object | false | For collections with content type `application/json`, you can define a JSON schema to validate structure. Uses JSON Schema Draft 07 format. |
\* Either `key` or `keyPrefix` must be specified for each collection, but not both.
diff --git a/sources/platform/actors/development/actor_definition/output_schema/index.md b/sources/platform/actors/development/actor_definition/output_schema/index.md
index 5b0dff5959..9cce2f9eba 100644
--- a/sources/platform/actors/development/actor_definition/output_schema/index.md
+++ b/sources/platform/actors/development/actor_definition/output_schema/index.md
@@ -29,7 +29,9 @@ You can organize the files using one of these structures:
"output": {
"actorOutputSchemaVersion": 1,
"title": "Output schema of the files scraper",
- "properties": { /* define your outputs here */ }
+ "properties": {
+ /* define your outputs here */
+ }
}
}
```
@@ -50,7 +52,9 @@ You can organize the files using one of these structures:
{
"actorOutputSchemaVersion": 1,
"title": "Output schema of the files scraper",
- "properties": { /* define your outputs here */ }
+ "properties": {
+ /* define your outputs here */
+ }
}
```
@@ -60,35 +64,35 @@ The output schema defines the collections of keys and their properties. It allow
### Output schema object definition
-| Property | Type | Required | Description |
-|-----------------------------------|-------------------------------|----------|-----------------------------------------------------------------------------------------------------------------|
-| `actorOutputSchemaVersion` | integer | true | Specifies the version of output schema structure document. Currently only version 1 is available. |
-| `title` | string | true | Title of the schema |
-| `description` | string | false | Description of the schema |
-| `properties` | Object | true | An object where each key is an output ID and its value is an output object definition (see below). |
+| Property | Type | Required | Description |
+| -------------------------- | ------- | -------- | ------------------------------------------------------------------------------------------------------ |
+| `actorOutputSchemaVersion` | integer | true | Specifies the version of output schema structure document. Currently only version 1 is available. |
+| `title` | string | true | Title of the schema |
+| `description` | string | false | Description of the schema |
+| `properties` | Object | true | An object where each key is an output ID and its value is an output object definition (see below). |
### Property object definition
-| Property | Type | Required | Description |
-|----------------|--------------|--------------|-------------------------------------------------------------------------------------------------------------------------------------------------|
-| `title` | string | true | The output's title, shown in the run's output tab if there are multiple outputs and in API as key for the generated output URL. |
-| `description` | string | false | A description of the output. Only used when reading the schema (useful for LLMs) |
-| `template` | string | true | Defines a template which will be translated into output URL. The template can use variables (see below) |
+| Property | Type | Required | Description |
+| ------------- | ------ | -------- | ------------------------------------------------------------------------------------------------------------------------------- |
+| `title` | string | true | The output's title, shown in the run's output tab if there are multiple outputs and in API as key for the generated output URL. |
+| `description` | string | false | A description of the output. Only used when reading the schema (useful for LLMs) |
+| `template` | string | true | Defines a template which will be translated into output URL. The template can use variables (see below) |
### Available template variables
-| Variable | Type | Description |
-|----------------|--------------|--------------|
-| `links` | object | Contains quick links to most commonly used URLs |
-| `links.publicRunUrl` | string | Public run url in format `https://console.apify.com/view/runs/:runId` |
-| `links.consoleRunUrl` | string | Console run url in format `https://console.apify.com/actors/runs/:runId` |
-| `links.apiRunUrl` | string | API run url in format `https://api.apify.com/v2/actor-runs/:runId` |
-| `links.apiDefaultDatasetUrl` | string | API url of default dataset in format `https://api.apify.com/v2/datasets/:defaultDatasetId` |
-| `links.apiDefaultKeyValueStoreUrl` | string | API url of default key-value store in format `https://api.apify.com/v2/key-value-stores/:defaultKeyValueStoreId` |
-| `run` | object | Contains information about the run same as it is returned from the `GET Run` API endpoint |
-| `run.containerUrl` | string | URL of a webserver running inside the run in format `https://.runs.apify.net/` |
-| `run.defaultDatasetId` | string | ID of the default dataset |
-| `run.defaultKeyValueStoreId` | string | ID of the default key-value store |
+| Variable | Type | Description |
+| ---------------------------------- | ------ | ---------------------------------------------------------------------------------------------------------------- |
+| `links` | object | Contains quick links to most commonly used URLs |
+| `links.publicRunUrl` | string | Public run url in format `https://console.apify.com/view/runs/:runId` |
+| `links.consoleRunUrl` | string | Console run url in format `https://console.apify.com/actors/runs/:runId` |
+| `links.apiRunUrl` | string | API run url in format `https://api.apify.com/v2/actor-runs/:runId` |
+| `links.apiDefaultDatasetUrl` | string | API url of default dataset in format `https://api.apify.com/v2/datasets/:defaultDatasetId` |
+| `links.apiDefaultKeyValueStoreUrl` | string | API url of default key-value store in format `https://api.apify.com/v2/key-value-stores/:defaultKeyValueStoreId` |
+| `run` | object | Contains information about the run same as it is returned from the `GET Run` API endpoint |
+| `run.containerUrl` | string | URL of a webserver running inside the run in format `https://.runs.apify.net/` |
+| `run.defaultDatasetId` | string | ID of the default dataset |
+| `run.defaultKeyValueStoreId` | string | ID of the default key-value store |
## Examples
@@ -104,8 +108,16 @@ await Actor.init();
/**
* Store data in default dataset
*/
-await Actor.pushData({ title: 'Some product', url: 'https://example.com/product/1', price: 9.99 });
-await Actor.pushData({ title: 'Another product', url: 'https://example.com/product/2', price: 4.99 });
+await Actor.pushData({
+ title: 'Some product',
+ url: 'https://example.com/product/1',
+ price: 9.99,
+});
+await Actor.pushData({
+ title: 'Another product',
+ url: 'https://example.com/product/2',
+ price: 4.99,
+});
// Exit successfully
await Actor.exit();
@@ -171,8 +183,12 @@ await Actor.init();
/**
* Store data in key-value store
*/
-await Actor.setValue('document-1.txt', 'my text data', { contentType: 'text/plain' });
-await Actor.setValue(`image-1.jpeg`, imageBuffer, { contentType: 'image/jpeg' });
+await Actor.setValue('document-1.txt', 'my text data', {
+ contentType: 'text/plain',
+});
+await Actor.setValue(`image-1.jpeg`, imageBuffer, {
+ contentType: 'image/jpeg',
+});
// Exit successfully
await Actor.exit();
diff --git a/sources/platform/actors/development/actor_definition/source_code.md b/sources/platform/actors/development/actor_definition/source_code.md
index d2edb69c0a..d4aaa63ee4 100644
--- a/sources/platform/actors/development/actor_definition/source_code.md
+++ b/sources/platform/actors/development/actor_definition/source_code.md
@@ -81,7 +81,6 @@ This `Dockerfile` does the following tasks:
By copying the `package.json` and `package-lock.json` files and installing dependencies before the rest of the source code, you can take advantage of Docker's caching mechanism. This approach ensures that dependencies are only reinstalled when the `package.json` or `package-lock.json` files change, significantly reducing build times. Since the installation of dependencies is often the most time-consuming part of the build process, this optimization can lead to substantial performance improvements, especially for larger projects with many dependencies.
-
:::
### `package.json`
diff --git a/sources/platform/actors/development/automated_tests.md b/sources/platform/actors/development/automated_tests.md
index 04dead97fc..dd74deda41 100644
--- a/sources/platform/actors/development/automated_tests.md
+++ b/sources/platform/actors/development/automated_tests.md
@@ -29,9 +29,9 @@ Example of Actor testing tasks
When creating test tasks:
-* Include a test for your Actor's default configuration
-* Set a low `maxItem` value to conserve credits
-* For large data tests, reduce test frequency to conserve credits
+- Include a test for your Actor's default configuration
+- Set a low `maxItem` value to conserve credits
+- For large data tests, reduce test frequency to conserve credits
## Configure the Actor Testing Actor
@@ -49,19 +49,14 @@ await expectAsync(runResult).toHaveStatus('SUCCEEDED');
-
```js
await expectAsync(runResult).withLog((log) => {
// Neither ReferenceError or TypeErrors should ever occur
// in production code – they mean the code is over-optimistic
// The errors must be dealt with gracefully and displayed with a helpful message to the user
- expect(log)
- .withContext(runResult.format('ReferenceError'))
- .not.toContain('ReferenceError');
+ expect(log).withContext(runResult.format('ReferenceError')).not.toContain('ReferenceError');
- expect(log)
- .withContext(runResult.format('TypeError'))
- .not.toContain('TypeError');
+ expect(log).withContext(runResult.format('TypeError')).not.toContain('TypeError');
});
```
@@ -71,9 +66,7 @@ await expectAsync(runResult).withLog((log) => {
```js
await expectAsync(runResult).withStatistics((stats) => {
// In most cases, you want it to be as close to zero as possible
- expect(stats.requestsRetries)
- .withContext(runResult.format('Request retries'))
- .toBeLessThan(3);
+ expect(stats.requestsRetries).withContext(runResult.format('Request retries')).toBeLessThan(3);
// What is the expected run time for the number of items?
expect(stats.crawlerRuntimeMillis)
@@ -88,14 +81,10 @@ await expectAsync(runResult).withStatistics((stats) => {
```js
await expectAsync(runResult).withDataset(({ dataset, info }) => {
// If you're sure, always set this number to be your exact maxItems
- expect(info.cleanItemCount)
- .withContext(runResult.format('Dataset cleanItemCount'))
- .toBe(3); // or toBeGreaterThan(1) or toBeWithinRange(1,3)
+ expect(info.cleanItemCount).withContext(runResult.format('Dataset cleanItemCount')).toBe(3); // or toBeGreaterThan(1) or toBeWithinRange(1,3)
// Make sure the dataset isn't empty
- expect(dataset.items)
- .withContext(runResult.format('Dataset items array'))
- .toBeNonEmptyArray();
+ expect(dataset.items).withContext(runResult.format('Dataset items array')).toBeNonEmptyArray();
const results = dataset.items;
@@ -105,9 +94,7 @@ await expectAsync(runResult).withDataset(({ dataset, info }) => {
.withContext(runResult.format('Direct url'))
.toStartWith('https://www.yelp.com/biz/');
- expect(result.bizId)
- .withContext(runResult.format('Biz ID'))
- .toBeNonEmptyString();
+ expect(result.bizId).withContext(runResult.format('Biz ID')).toBeNonEmptyString();
}
});
```
@@ -116,15 +103,14 @@ await expectAsync(runResult).withDataset(({ dataset, info }) => {
```js
-await expectAsync(runResult).withKeyValueStore(({ contentType }) => {
- // Check for the proper content type of the saved key-value item
- expect(contentType)
- .withContext(runResult.format('KVS contentType'))
- .toBe('image/gif');
-},
-
-// This also checks for existence of the key-value key
-{ keyName: 'apify.com-scroll_lossless-comp' },
+await expectAsync(runResult).withKeyValueStore(
+ ({ contentType }) => {
+ // Check for the proper content type of the saved key-value item
+ expect(contentType).withContext(runResult.format('KVS contentType')).toBe('image/gif');
+ },
+
+ // This also checks for existence of the key-value key
+ { keyName: 'apify.com-scroll_lossless-comp' },
);
```
diff --git a/sources/platform/actors/development/builds_and_runs/index.md b/sources/platform/actors/development/builds_and_runs/index.md
index 5c3c0c71fb..ea9a08edbd 100644
--- a/sources/platform/actors/development/builds_and_runs/index.md
+++ b/sources/platform/actors/development/builds_and_runs/index.md
@@ -71,7 +71,7 @@ flowchart LR
---
| Status | Type | Description |
-|------------|--------------|---------------------------------------------|
+| ---------- | ------------ | ------------------------------------------- |
| READY | initial | Started but not allocated to any worker yet |
| RUNNING | transitional | Executing on a worker machine |
| SUCCEEDED | terminal | Finished successfully |
diff --git a/sources/platform/actors/development/builds_and_runs/state_persistence.md b/sources/platform/actors/development/builds_and_runs/state_persistence.md
index 218b88f387..386fa12a2a 100644
--- a/sources/platform/actors/development/builds_and_runs/state_persistence.md
+++ b/sources/platform/actors/development/builds_and_runs/state_persistence.md
@@ -104,7 +104,7 @@ import { Actor } from 'apify';
await Actor.init();
// ...
-const previousCrawlingState = await Actor.getValue('my-crawling-state') || {};
+const previousCrawlingState = (await Actor.getValue('my-crawling-state')) || {};
// ...
await Actor.exit();
```
diff --git a/sources/platform/actors/development/deployment/continuous_integration.md b/sources/platform/actors/development/deployment/continuous_integration.md
index 9f7d0e3bc7..a3fb0446bc 100644
--- a/sources/platform/actors/development/deployment/continuous_integration.md
+++ b/sources/platform/actors/development/deployment/continuous_integration.md
@@ -13,18 +13,16 @@ import TabItem from '@theme/TabItem';
---
-Automating your Actor development process can save time and reduce errors, especially for projects with multiple Actors or frequent updates. Instead of manually pushing code, building Actors, and running tests, you can automate these steps to run whenever you push code to your repository.
+Automating your Actor development process can save time and reduce errors, especially for projects with multiple Actors or frequent updates. Instead of manually pushing code, building Actors, and running tests, you can automate these steps to run whenever you push code to your repository.
You can automate Actor builds and tests using your Git repository's automated workflows like [GitHub Actions](https://github.com/features/actions) or [Bitbucket Pipelines](https://www.atlassian.com/software/bitbucket/features/pipelines).
-
:::tip Using Bitbucket?
Follow our step-by-step guide to set up continuous integration for your Actors with Bitbucket Pipelines: [Read the Bitbucket CI guide](https://help.apify.com/en/articles/6988586-setting-up-continuous-integration-for-apify-actors-on-bitbucket).
:::
-
Set up continuous integration for your Actors using one of these methods:
- [Trigger builds with a Webhook](#option-1-trigger-builds-with-a-webhook)
@@ -37,15 +35,15 @@ Choose the method that best fits your workflow.
1. Push your Actor to a GitHub repository.
1. Go to your Actor's detail page in Apify Console, click on the API tab in the top right, then select API Endpoints. Copy the **Build Actor** API endpoint URL. The format is as follows:
- ```cURL
- https://api.apify.com/v2/acts/YOUR-ACTOR-NAME/builds?token=YOUR-TOKEN-HERE&version=0.0&tag=beta&waitForFinish=60
- ```
+ ```cURL
+ https://api.apify.com/v2/acts/YOUR-ACTOR-NAME/builds?token=YOUR-TOKEN-HERE&version=0.0&tag=beta&waitForFinish=60
+ ```
- :::note API token
+ :::note API token
- Make sure you select the correct API token from the dropdown.
+ Make sure you select the correct API token from the dropdown.
- :::
+ :::
1. In your GitHub repository, go to Settings > Webhooks > Add webhook.
1. Paste the API URL into the Payload URL field and add the webhook.
@@ -62,35 +60,35 @@ Now your Actor will automatically rebuild on every push to the GitHub repository

1. Add your Apify token to GitHub secrets
- 1. Go to your repository > Settings > Secrets and variables > Actions > New repository secret
- 1. Name the secret and paste in your token
+ 1. Go to your repository > Settings > Secrets and variables > Actions > New repository secret
+ 1. Name the secret and paste in your token
- 
+ 
1. Add the Build Actor API endpoint URL to GitHub secrets
- 1. Go to your repository > Settings > Secrets and variables > Actions > New repository secret
- 1. In Apify Console, go to your Actor's detail page, click the API tab in the top right, and then select API Endpoints. Copy the **Build Actor** API endpoint URL. The format is as follows:
+ 1. Go to your repository > Settings > Secrets and variables > Actions > New repository secret
+ 1. In Apify Console, go to your Actor's detail page, click the API tab in the top right, and then select API Endpoints. Copy the **Build Actor** API endpoint URL. The format is as follows:
- :::note API token
+ :::note API token
- Make sure you select the correct API token from the dropdown.
+ Make sure you select the correct API token from the dropdown.
- :::
+ :::
- ```cURL
- https://api.apify.com/v2/acts/YOUR-ACTOR-NAME/builds?token=YOUR-TOKEN-HERE&version=0.0&tag=latest&waitForFinish=60
- ```
+ ```cURL
+ https://api.apify.com/v2/acts/YOUR-ACTOR-NAME/builds?token=YOUR-TOKEN-HERE&version=0.0&tag=latest&waitForFinish=60
+ ```
- 1. Name the secret & paste in your API endpoint
+ 1. Name the secret & paste in your API endpoint
- 
+ 
1. Create GitHub Actions workflow files:
- 1. In your repository, create the `.github/workflows` directory
- 1. Add `latest.yml`. If you want, you can also add `beta.yml` to build Actors from the develop branch (or other branches).
+ 1. In your repository, create the `.github/workflows` directory
+ 1. Add `latest.yml`. If you want, you can also add `beta.yml` to build Actors from the develop branch (or other branches).
-
-
+
+
:::note Use your secret names
@@ -101,29 +99,28 @@ Now your Actor will automatically rebuild on every push to the GitHub repository
```yaml
name: Test and build latest version
on:
- push:
- branches:
- - master
- - main
+ push:
+ branches:
+ - master
+ - main
jobs:
- test-and-build:
- runs-on: ubuntu-latest
- steps:
- # Install dependencies and run tests
- - uses: actions/checkout@v2
- - run: npm install && npm run test
- # Build latest version
- - uses: distributhor/workflow-webhook@v1
- env:
- webhook_url: ${{ secrets.BUILD_ACTOR_URL }}
- webhook_secret: ${{ secrets.APIFY_TOKEN }}
-
+ test-and-build:
+ runs-on: ubuntu-latest
+ steps:
+ # Install dependencies and run tests
+ - uses: actions/checkout@v2
+ - run: npm install && npm run test
+ # Build latest version
+ - uses: distributhor/workflow-webhook@v1
+ env:
+ webhook_url: ${{ secrets.BUILD_ACTOR_URL }}
+ webhook_secret: ${{ secrets.APIFY_TOKEN }}
```
With this setup, pushing to the `main` or `master` branch tests the code and builds a new latest version.
-
-
+
+
:::note Use your secret names
@@ -134,28 +131,27 @@ Now your Actor will automatically rebuild on every push to the GitHub repository
```yaml
name: Test and build beta version
on:
- push:
- branches:
- - develop
+ push:
+ branches:
+ - develop
jobs:
- test-and-build:
- runs-on: ubuntu-latest
- steps:
- # Install dependencies and run tests
- - uses: actions/checkout@v2
- - run: npm install && npm run test
- # Build beta version
- - uses: distributhor/workflow-webhook@v1
- env:
- webhook_url: ${{ secrets.BUILD_ACTOR_URL }}
- webhook_secret: ${{ secrets.APIFY_TOKEN }}
-
+ test-and-build:
+ runs-on: ubuntu-latest
+ steps:
+ # Install dependencies and run tests
+ - uses: actions/checkout@v2
+ - run: npm install && npm run test
+ # Build beta version
+ - uses: distributhor/workflow-webhook@v1
+ env:
+ webhook_url: ${{ secrets.BUILD_ACTOR_URL }}
+ webhook_secret: ${{ secrets.APIFY_TOKEN }}
```
With this setup, pushing to the `develop` branch tests the code and builds a new beta version.
-
-
+
+
## Conclusion
diff --git a/sources/platform/actors/development/deployment/index.md b/sources/platform/actors/development/deployment/index.md
index ccbd537a3e..ada09272cb 100644
--- a/sources/platform/actors/development/deployment/index.md
+++ b/sources/platform/actors/development/deployment/index.md
@@ -68,4 +68,3 @@ To deploy using other methods, first create the Actor manually through Apify CLI
You can link your Actor to a Git repository, Gist, or a Zip file.
For more information on alternative source types, check out next chapter.
-
diff --git a/sources/platform/actors/development/deployment/source_types.md b/sources/platform/actors/development/deployment/source_types.md
index f9f57408c7..e018d1d437 100644
--- a/sources/platform/actors/development/deployment/source_types.md
+++ b/sources/platform/actors/development/deployment/source_types.md
@@ -13,9 +13,9 @@ This section explains the various sources types available for Apify Actors and h
- [Web IDE](#web-ide)
- [Git repository](#git-repository)
- - [Private repositories](#private-repositories)
- - [How to configure deployment keys](#how-to-configure-deployment-keys)
- - [Actor monorepos](#actor-monorepos)
+ - [Private repositories](#private-repositories)
+ - [How to configure deployment keys](#how-to-configure-deployment-keys)
+ - [Actor monorepos](#actor-monorepos)
- [Zip file](#zip-file)
- [GitHub Gist](#github-gist)
diff --git a/sources/platform/actors/development/permissions/index.md b/sources/platform/actors/development/permissions/index.md
index c1bed09cdb..13af37810a 100644
--- a/sources/platform/actors/development/permissions/index.md
+++ b/sources/platform/actors/development/permissions/index.md
@@ -13,7 +13,7 @@ Every time a user runs your Actor, it runs under their Apify account. **Actor pe
Your Actors can request two levels of access:
-- **Limited permissions:** Actors with this permission level have restricted access, primarily to their own storages and the data they generate. They cannot access other user data on the Apify platform.
+- **Limited permissions:** Actors with this permission level have restricted access, primarily to their own storages and the data they generate. They cannot access other user data on the Apify platform.
- **Full permissions:** This level grants an Actor access to all of a user's Apify account data.
Most Actors should use limited permissions to request only the specific access they need and reserve full permissions for exceptional cases where the Actor cannot function otherwise.
@@ -26,7 +26,7 @@ Actors with **Full permissions** receive a token that grants complete access to
Actors with **Limited permissions** receive [a restricted scoped token](../../../integrations/programming/api.md#api-tokens-with-limited-permissions). This token only allows the Actor to perform a specific set of actions, which covers the vast majority of common use cases.
- A limited-permission Actor can:
+A limited-permission Actor can:
- Read and write to its default storages.
- Create any additional storage, and write to that storage.
@@ -65,7 +65,6 @@ When possible, design your Actors to use limited permissions and request only th
:::
-
### Accessing user provided storages
By default, limited-permissions Actors can't access user storages. However, they can access storages that users explicitly provide via the Actor input. To do so, use the input schema to add a storage picker and declare exactly which operations your Actor needs.
@@ -80,11 +79,11 @@ Example input schema field (single resource):
```json
{
- "title": "Output dataset",
- "type": "string",
- "editor": "resourcePicker",
- "resourceType": "dataset",
- "resourcePermissions": ["READ", "WRITE"]
+ "title": "Output dataset",
+ "type": "string",
+ "editor": "resourcePicker",
+ "resourceType": "dataset",
+ "resourcePermissions": ["READ", "WRITE"]
}
```
@@ -95,12 +94,12 @@ Selecting multiple resources:
```json
{
- "title": "Source datasets",
- "type": "array",
- "editor": "resourcePicker",
- "resourceType": "dataset",
- "resourcePermissions": ["READ"],
- "minItems": 1
+ "title": "Source datasets",
+ "type": "array",
+ "editor": "resourcePicker",
+ "resourceType": "dataset",
+ "resourcePermissions": ["READ"],
+ "minItems": 1
}
```
@@ -119,7 +118,6 @@ Designing your Actors to work under limited permissions is the recommended appro
- Set the permission level in the Actor’s **Settings** in Console to **Full permissions**.
- Be aware of the [UX implications](#end-user-experience) and impact on [Actor Quality score](../../publishing/quality_score.mdx) for full-permission Actors.
-
:::info Need help with Actor permissions?
If you cannot migrate to limited permissions or have a use case that should work under limited permissions but does not, contact support or ask on [the community forum](https://discord.gg/eN73Xdhtqc).
diff --git a/sources/platform/actors/development/permissions/migration_guide.md b/sources/platform/actors/development/permissions/migration_guide.md
index 383b836850..f24c19da62 100644
--- a/sources/platform/actors/development/permissions/migration_guide.md
+++ b/sources/platform/actors/development/permissions/migration_guide.md
@@ -40,7 +40,6 @@ Or just using the API:
POST https://api.apify.com/v2/acts//runs?forcePermissionLevel=LIMITED_PERMISSIONS
```
-
## Common migration paths
Most public Actors can migrate to limited permissions with minor adjustments, if any. The general prerequisite is to **update the Actor to use the latest [Apify SDK](https://docs.apify.com/sdk)**. To assess what needs to change in your Actor, review these areas:
@@ -74,7 +73,7 @@ For example, your Actor allows the user to provide a custom dataset for the Acto
{
"title": "Output",
"type": "string",
- "description": "Select a dataset for the Actor results",
+ "description": "Select a dataset for the Actor results"
}
```
@@ -84,10 +83,10 @@ To support limited permissions, change it to this:
{
"title": "Output",
"type": "string",
- "description": "Select a dataset for the Actor results",
+ "description": "Select a dataset for the Actor results",
"resourceType": "dataset",
"resourcePermissions": ["READ", "WRITE"],
- "editor": "textfield", // If you want to preserve the plain "string" input UI, instead of rich resource picker.
+ "editor": "textfield" // If you want to preserve the plain "string" input UI, instead of rich resource picker.
}
```
@@ -123,7 +122,7 @@ if (process.env.ACTOR_PERMISSION_LEVEL === 'LIMITED_PERMISSIONS') {
// and will allow access in all follow-up runs.
store = await Actor.openKeyValueStore(NEW_CACHE_STORE_NAME);
} else {
- // If the Actor is still running with full permissions and we should use
+ // If the Actor is still running with full permissions and we should use
// the existing store.
store = await Actor.openKeyValueStore(OLD_CACHE_STORE_NAME);
}
diff --git a/sources/platform/actors/development/programming_interface/actor_standby.md b/sources/platform/actors/development/programming_interface/actor_standby.md
index f8cdddb7f1..954ed0466b 100644
--- a/sources/platform/actors/development/programming_interface/actor_standby.md
+++ b/sources/platform/actors/development/programming_interface/actor_standby.md
@@ -83,7 +83,6 @@ You must return a response; otherwise, the Actor run will never be marked as rea
:::
-
See example code below that distinguishes between "normal" and "readiness probe" requests.
diff --git a/sources/platform/actors/development/programming_interface/basic_commands.md b/sources/platform/actors/development/programming_interface/basic_commands.md
index 105009ecf4..0cdc487649 100644
--- a/sources/platform/actors/development/programming_interface/basic_commands.md
+++ b/sources/platform/actors/development/programming_interface/basic_commands.md
@@ -34,7 +34,6 @@ await Actor.exit();
Alternatively, use the `main()` function for environments that don't support top-level awaits. The `main()` function is syntax-sugar for `init()` and `exit()`. It will call `init()` before it executes its callback and `exit()` after the callback resolves.
-
```js
import { Actor } from 'apify';
@@ -230,7 +229,6 @@ async def main():
To exit immediately without calling exit handlers:
-
@@ -273,7 +271,9 @@ import { Actor } from 'apify';
await Actor.init();
// ...
// Actor will finish with 'FAILED' status
-await Actor.exit('Could not finish the crawl, try increasing memory', { exitCode: 1 });
+await Actor.exit('Could not finish the crawl, try increasing memory', {
+ exitCode: 1,
+});
```
diff --git a/sources/platform/actors/development/programming_interface/container_web_server.md b/sources/platform/actors/development/programming_interface/container_web_server.md
index aff8b31dfd..7ef6288787 100644
--- a/sources/platform/actors/development/programming_interface/container_web_server.md
+++ b/sources/platform/actors/development/programming_interface/container_web_server.md
@@ -59,9 +59,11 @@ app.get('/', (req, res) => {
res.send('Hello world from Express app!');
});
-app.listen(port, () => console.log(`Web server is listening
+app.listen(port, () =>
+ console.log(`Web server is listening
and can be accessed at
- ${process.env.ACTOR_WEB_SERVER_URL}!`));
+ ${process.env.ACTOR_WEB_SERVER_URL}!`),
+);
// Let the Actor run for an hour
await new Promise((r) => setTimeout(r, 60 * 60 * 1000));
diff --git a/sources/platform/actors/development/programming_interface/environment_variables.md b/sources/platform/actors/development/programming_interface/environment_variables.md
index b295748a64..1fd8e4dcc8 100644
--- a/sources/platform/actors/development/programming_interface/environment_variables.md
+++ b/sources/platform/actors/development/programming_interface/environment_variables.md
@@ -34,45 +34,44 @@ Apify sets several system environment variables for each Actor run. These variab
Here's a table of key system environment variables:
-| Environment Variable | Description |
-|----------------------|-------------|
-| `ACTOR_ID` | ID of the Actor. |
-| `ACTOR_FULL_NAME` | Full technical name of the Actor, in the format `owner-username/actor-name`. |
-| `ACTOR_RUN_ID` | ID of the Actor run. |
-| `ACTOR_BUILD_ID` | ID of the Actor build used in the run. |
-| `ACTOR_BUILD_NUMBER` | Build number of the Actor build used in the run. |
-| `ACTOR_BUILD_TAGS` | A comma-separated list of tags of the Actor build used in the run. Note that this environment variable is assigned at the time of start of the Actor and doesn't change over time, even if the assigned build tags change. |
-| `ACTOR_TASK_ID` | ID of the Actor task. Empty if Actor is run outside of any task, e.g. directly using the API. |
-| `ACTOR_EVENTS_WEBSOCKET_URL` | Websocket URL where Actor may listen for [events](/platform/actors/development/programming-interface/system-events) from Actor platform. |
-| `ACTOR_DEFAULT_DATASET_ID` | Unique identifier for the default dataset associated with the current Actor run. |
-| `ACTOR_DEFAULT_KEY_VALUE_STORE_ID` | Unique identifier for the default key-value store associated with the current Actor run. |
-| `ACTOR_DEFAULT_REQUEST_QUEUE_ID` | Unique identifier for the default request queue associated with the current Actor run. |
-| `ACTOR_INPUT_KEY` | Key of the record in the default key-value store that holds the [Actor input](/platform/actors/running/input-and-output#input). |
-| `ACTOR_MAX_PAID_DATASET_ITEMS` | For paid-per-result Actors, the user-set limit on returned results. Do not exceed this limit. |
-| `ACTOR_MAX_TOTAL_CHARGE_USD` | For pay-per-event Actors, the user-set limit on run cost. Do not exceed this limit. |
-| `ACTOR_RESTART_ON_ERROR` | If **1**, the Actor run will be restarted if it fails. |
-| `APIFY_HEADLESS` | If **1**, web browsers inside the Actor should run in headless mode (no windowing system available). |
-| `APIFY_IS_AT_HOME` | Contains **1** if the Actor is running on Apify servers. |
-| `ACTOR_MEMORY_MBYTES` | Size of memory allocated for the Actor run, in megabytes. Can be used to optimize memory usage or finetuning of low-level external libraries. |
-| `ACTOR_PERMISSION_LEVEL` | [Permission level](../../running/permissions.md) the Actor is run under (`LIMITED_PERMISSIONS` or `FULL_PERMISSIONS`). This determines what resources in the user’s account the Actor can access. |
-| `APIFY_PROXY_PASSWORD` | Password for accessing Apify Proxy services. This password enables the Actor to utilize proxy servers on behalf of the user who initiated the Actor run. |
-| `APIFY_PROXY_PORT` | TCP port number to be used for connecting to the Apify Proxy. |
-| `APIFY_PROXY_STATUS_URL` | URL for retrieving proxy status information. Appending `?format=json` to this URL returns the data in JSON format for programmatic processing. |
-| `ACTOR_STANDBY_URL` | URL for accessing web servers of Actor runs in the [Actor Standby](/platform/actors/development/programming-interface/standby) mode. |
-| `ACTOR_STARTED_AT` | Date when the Actor was started. |
-| `ACTOR_TIMEOUT_AT` | Date when the Actor will time out. |
-| `APIFY_TOKEN` | API token of the user who started the Actor. |
-| `APIFY_USER_ID` | ID of the user who started the Actor. May differ from the Actor owner. |
-| `APIFY_USER_IS_PAYING` | If it is `1`, it means that the user who started the Actor is a paying user. |
-| `ACTOR_WEB_SERVER_PORT` | TCP port for the Actor to start an HTTP server on. This server can be used to receive external messages or expose monitoring and control interfaces. The server also receives messages from the [Actor Standby](/platform/actors/development/programming-interface/standby) mode. |
-| `ACTOR_WEB_SERVER_URL` | Unique public URL for accessing the Actor run web server from the outside world. |
-| `APIFY_API_PUBLIC_BASE_URL` | Public URL of the Apify API. May be used to interact with the platform programmatically. Typically set to `api.apify.com`. |
-| `APIFY_DEDICATED_CPUS` | Number of CPU cores reserved for the Actor, based on allocated memory. |
-| `APIFY_WORKFLOW_KEY` | Identifier used for grouping related runs and API calls together. |
-| `APIFY_META_ORIGIN` | Specifies how an Actor run was started. Possible values are in [Runs and builds](/platform/actors/running/runs-and-builds#origin) documentation. |
-| `APIFY_INPUT_SECRETS_KEY_FILE` | Path to the secret key used to decrypt [Secret inputs](/platform/actors/development/actor-definition/input-schema/secret-input). |
-| `APIFY_INPUT_SECRETS_KEY_PASSPHRASE` | Passphrase for the input secret key specified in `APIFY_INPUT_SECRETS_KEY_FILE`. |
-
+| Environment Variable | Description |
+| ------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `ACTOR_ID` | ID of the Actor. |
+| `ACTOR_FULL_NAME` | Full technical name of the Actor, in the format `owner-username/actor-name`. |
+| `ACTOR_RUN_ID` | ID of the Actor run. |
+| `ACTOR_BUILD_ID` | ID of the Actor build used in the run. |
+| `ACTOR_BUILD_NUMBER` | Build number of the Actor build used in the run. |
+| `ACTOR_BUILD_TAGS` | A comma-separated list of tags of the Actor build used in the run. Note that this environment variable is assigned at the time of start of the Actor and doesn't change over time, even if the assigned build tags change. |
+| `ACTOR_TASK_ID` | ID of the Actor task. Empty if Actor is run outside of any task, e.g. directly using the API. |
+| `ACTOR_EVENTS_WEBSOCKET_URL` | Websocket URL where Actor may listen for [events](/platform/actors/development/programming-interface/system-events) from Actor platform. |
+| `ACTOR_DEFAULT_DATASET_ID` | Unique identifier for the default dataset associated with the current Actor run. |
+| `ACTOR_DEFAULT_KEY_VALUE_STORE_ID` | Unique identifier for the default key-value store associated with the current Actor run. |
+| `ACTOR_DEFAULT_REQUEST_QUEUE_ID` | Unique identifier for the default request queue associated with the current Actor run. |
+| `ACTOR_INPUT_KEY` | Key of the record in the default key-value store that holds the [Actor input](/platform/actors/running/input-and-output#input). |
+| `ACTOR_MAX_PAID_DATASET_ITEMS` | For paid-per-result Actors, the user-set limit on returned results. Do not exceed this limit. |
+| `ACTOR_MAX_TOTAL_CHARGE_USD` | For pay-per-event Actors, the user-set limit on run cost. Do not exceed this limit. |
+| `ACTOR_RESTART_ON_ERROR` | If **1**, the Actor run will be restarted if it fails. |
+| `APIFY_HEADLESS` | If **1**, web browsers inside the Actor should run in headless mode (no windowing system available). |
+| `APIFY_IS_AT_HOME` | Contains **1** if the Actor is running on Apify servers. |
+| `ACTOR_MEMORY_MBYTES` | Size of memory allocated for the Actor run, in megabytes. Can be used to optimize memory usage or finetuning of low-level external libraries. |
+| `ACTOR_PERMISSION_LEVEL` | [Permission level](../../running/permissions.md) the Actor is run under (`LIMITED_PERMISSIONS` or `FULL_PERMISSIONS`). This determines what resources in the user’s account the Actor can access. |
+| `APIFY_PROXY_PASSWORD` | Password for accessing Apify Proxy services. This password enables the Actor to utilize proxy servers on behalf of the user who initiated the Actor run. |
+| `APIFY_PROXY_PORT` | TCP port number to be used for connecting to the Apify Proxy. |
+| `APIFY_PROXY_STATUS_URL` | URL for retrieving proxy status information. Appending `?format=json` to this URL returns the data in JSON format for programmatic processing. |
+| `ACTOR_STANDBY_URL` | URL for accessing web servers of Actor runs in the [Actor Standby](/platform/actors/development/programming-interface/standby) mode. |
+| `ACTOR_STARTED_AT` | Date when the Actor was started. |
+| `ACTOR_TIMEOUT_AT` | Date when the Actor will time out. |
+| `APIFY_TOKEN` | API token of the user who started the Actor. |
+| `APIFY_USER_ID` | ID of the user who started the Actor. May differ from the Actor owner. |
+| `APIFY_USER_IS_PAYING` | If it is `1`, it means that the user who started the Actor is a paying user. |
+| `ACTOR_WEB_SERVER_PORT` | TCP port for the Actor to start an HTTP server on. This server can be used to receive external messages or expose monitoring and control interfaces. The server also receives messages from the [Actor Standby](/platform/actors/development/programming-interface/standby) mode. |
+| `ACTOR_WEB_SERVER_URL` | Unique public URL for accessing the Actor run web server from the outside world. |
+| `APIFY_API_PUBLIC_BASE_URL` | Public URL of the Apify API. May be used to interact with the platform programmatically. Typically set to `api.apify.com`. |
+| `APIFY_DEDICATED_CPUS` | Number of CPU cores reserved for the Actor, based on allocated memory. |
+| `APIFY_WORKFLOW_KEY` | Identifier used for grouping related runs and API calls together. |
+| `APIFY_META_ORIGIN` | Specifies how an Actor run was started. Possible values are in [Runs and builds](/platform/actors/running/runs-and-builds#origin) documentation. |
+| `APIFY_INPUT_SECRETS_KEY_FILE` | Path to the secret key used to decrypt [Secret inputs](/platform/actors/development/actor-definition/input-schema/secret-input). |
+| `APIFY_INPUT_SECRETS_KEY_PASSPHRASE` | Passphrase for the input secret key specified in `APIFY_INPUT_SECRETS_KEY_FILE`. |
@@ -81,6 +80,7 @@ Here's a table of key system environment variables:
All date-related variables use the UTC timezone and are in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) format (e.g., _2022-07-13T14:23:37.281Z_).
:::
+
## Set up environment variables in `actor.json`
@@ -89,13 +89,13 @@ Actor owners can define custom environment variables in `.actor/actor.json`. All
```json
{
- "actorSpecification": 1,
- "name": "dataset-to-mysql",
- "version": "0.1",
- "buildTag": "latest",
- "environmentVariables": {
- "MYSQL_USER": "my_username",
- }
+ "actorSpecification": 1,
+ "name": "dataset-to-mysql",
+ "version": "0.1",
+ "buildTag": "latest",
+ "environmentVariables": {
+ "MYSQL_USER": "my_username"
+ }
}
```
@@ -139,7 +139,7 @@ import { Actor } from 'apify';
await Actor.init();
// get MYSQL_USER
-const mysql_user = process.env.MYSQL_USER
+const mysql_user = process.env.MYSQL_USER;
// print MYSQL_USER to console
console.log(mysql_user);
diff --git a/sources/platform/actors/development/programming_interface/status_messages.md b/sources/platform/actors/development/programming_interface/status_messages.md
index 832b8f9188..38d215c139 100644
--- a/sources/platform/actors/development/programming_interface/status_messages.md
+++ b/sources/platform/actors/development/programming_interface/status_messages.md
@@ -14,16 +14,16 @@ import TabItem from '@theme/TabItem';
Each Actor run has a status, represented by the `status` field. The following table describes the possible values:
-|Status|Type|Description|
-|--- |--- |--- |
-|`READY`|initial|Started but not allocated to any worker yet|
-|`RUNNING`|transitional|Executing on a worker|
-|`SUCCEEDED`|terminal|Finished successfully|
-|`FAILED`|terminal|Run failed|
-|`TIMING-OUT`|transitional|Timing out now|
-|`TIMED-OUT`|terminal|Timed out|
-|`ABORTING`|transitional|Being aborted by user|
-|`ABORTED`|terminal|Aborted by user|
+| Status | Type | Description |
+| ------------ | ------------ | ------------------------------------------- |
+| `READY` | initial | Started but not allocated to any worker yet |
+| `RUNNING` | transitional | Executing on a worker |
+| `SUCCEEDED` | terminal | Finished successfully |
+| `FAILED` | terminal | Run failed |
+| `TIMING-OUT` | transitional | Timing out now |
+| `TIMED-OUT` | terminal | Timed out |
+| `ABORTING` | transitional | Being aborted by user |
+| `ABORTED` | terminal | Aborted by user |
## Status messages
diff --git a/sources/platform/actors/development/programming_interface/system_events.md b/sources/platform/actors/development/programming_interface/system_events.md
index bcb9e8c10c..d9f100a7d9 100644
--- a/sources/platform/actors/development/programming_interface/system_events.md
+++ b/sources/platform/actors/development/programming_interface/system_events.md
@@ -27,13 +27,12 @@ These events help you manage your Actor's behavior and resources effectively.
The following table outlines the system events available:
-
-| Event name | Payload | Description |
-| -------------- | ------- | ----------- |
-| `cpuInfo` | `{ isCpuOverloaded: Boolean }` | Emitted approximately every second, indicating whether the Actor is using maximum available CPU resources. |
-| `migrating` | `{ timeRemainingSecs: Float }` | Signals that the Actor will soon migrate to another worker server on the Apify platform. |
-| `aborting` | N/A | Triggered when a user initiates a graceful abort of an Actor run, allowing time for cleanup. |
-| `persistState` | `{ isMigrating: Boolean }` | Emitted at regular intervals (default: _60 seconds_) to notify Apify SDK components to persist their state. |
+| Event name | Payload | Description |
+| -------------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------- |
+| `cpuInfo` | `{ isCpuOverloaded: Boolean }` | Emitted approximately every second, indicating whether the Actor is using maximum available CPU resources. |
+| `migrating` | `{ timeRemainingSecs: Float }` | Signals that the Actor will soon migrate to another worker server on the Apify platform. |
+| `aborting` | N/A | Triggered when a user initiates a graceful abort of an Actor run, allowing time for cleanup. |
+| `persistState` | `{ isMigrating: Boolean }` | Emitted at regular intervals (default: _60 seconds_) to notify Apify SDK components to persist their state. |
## How system events work
diff --git a/sources/platform/actors/running/actor_standby.md b/sources/platform/actors/running/actor_standby.md
index b2c171b477..88b3e40ea6 100644
--- a/sources/platform/actors/running/actor_standby.md
+++ b/sources/platform/actors/running/actor_standby.md
@@ -44,16 +44,18 @@ You can provide your [API token](../../integrations/programming/api.md#api-token
```
2. Append the token as a query parameter named `token` to the request URL.
-This approach can be useful if you cannot modify the request headers.
+ This approach can be useful if you cannot modify the request headers.
```text
https://rag-web-browser.apify.actor/search?query=apify&token=my_apify_token
```
:::tip
+
You can use [scoped tokens](/platform/integrations/api#limited-permissions) to send standby requests. This is useful for allowing third-party services to interact with your Actor without granting access to your entire account.
However, [restricting what an Actor can access](/platform/integrations/api#restricted-access-restrict-what-actors-can-access-using-the-scope-of-this-actor) using a scoped token is not supported when running in Standby mode.
+
:::
## Can I still run the Actor in normal mode
diff --git a/sources/platform/actors/running/index.md b/sources/platform/actors/running/index.md
index 88be596a39..10563f5c1a 100644
--- a/sources/platform/actors/running/index.md
+++ b/sources/platform/actors/running/index.md
@@ -47,7 +47,6 @@ Shortly you will see the first results popping up:

-
And you can use the export button at the bottom left to export the data in multiple formats:

@@ -94,7 +93,6 @@ console.dir(items);
-
```python
diff --git a/sources/platform/actors/running/input_and_output.md b/sources/platform/actors/running/input_and_output.md
index 1a9277630c..9218554eb4 100644
--- a/sources/platform/actors/running/input_and_output.md
+++ b/sources/platform/actors/running/input_and_output.md
@@ -33,12 +33,11 @@ As part of the input, you can also specify run options such as [Build](../develo

-| Option | Description |
-|:---|:---|
-| Build | Tag or number of the build to run (e.g. **latest** or **1.2.34**). |
+| Option | Description |
+| :------ | :-------------------------------------------------------------------------- |
+| Build | Tag or number of the build to run (e.g. **latest** or **1.2.34**). |
| Timeout | Timeout for the Actor run in seconds. Zero value means there is no timeout. |
-| Memory | Amount of memory allocated for the Actor run, in megabytes. |
-
+| Memory | Amount of memory allocated for the Actor run, in megabytes. |
## Output
diff --git a/sources/platform/actors/running/permissions.md b/sources/platform/actors/running/permissions.md
index 7726f77de8..f59d836ca1 100644
--- a/sources/platform/actors/running/permissions.md
+++ b/sources/platform/actors/running/permissions.md
@@ -1,6 +1,6 @@
---
title: Permissions
-description: "Learn how Actor permissions work for running and building Actors: available permission levels, requesting and granting permissions, and security best practices."
+description: 'Learn how Actor permissions work for running and building Actors: available permission levels, requesting and granting permissions, and security best practices.'
sidebar_position: 5
slug: /actors/running/permissions
---
@@ -17,7 +17,6 @@ The approach is similar to mobile platforms (Android, iOS) where each app explic
::::
-
The permissions model follows the principle of least privilege. Actors run only with the access they explicitly request, giving you transparency and control over what the Actor can access in their account.
There are two permission levels:
@@ -25,12 +24,10 @@ There are two permission levels:
- **Limited permissions:** Actors with this permission level have restricted access, primarily to their own storages, the data they generate, and resources they are given an explicit access to. They cannot access any other data in your Apify account.
- **Full permissions (default):** Grants the Actor a access to all data in your Apify account.
-
This model protects your data and strengthens platform security by clearly showing what level of access each Actor requires.
Actors using **Limited permissions** are safer to run and suit most tasks. Actors that need **full permissions** (for example to perform administrative tasks in your account, manage your datasets or schedules) clearly indicate this in their detail page.
-
## How Actor permissions work
When a user runs an Actor, it receives an Apify API token. Traditionally, this token grants access to the user's entire Apify account via Apify API. Actors with **full permissions** will continue to operate this way.
@@ -50,7 +47,6 @@ A limited-permission Actor can:
This approach ensures the Actor has everything it needs to function while protecting your data from unnecessary exposure.
-
### Recognizing permission levels in Console and Store
When you browse Actors in Apify Console or Store, you’ll notice a small badge next to each Actor showing its permission level. Hover over the badge to see a short explanation of what access that Actor will have when it runs under your account. Here's how they appear in the Console.
diff --git a/sources/platform/actors/running/runs_and_builds.md b/sources/platform/actors/running/runs_and_builds.md
index 9accc07dae..6301f1ab05 100644
--- a/sources/platform/actors/running/runs_and_builds.md
+++ b/sources/platform/actors/running/runs_and_builds.md
@@ -41,16 +41,16 @@ What's happening inside of an Actor is visible on the Actor run log in the Actor
Both **Actor runs** and **builds** have the **Origin** field indicating how the Actor run or build was invoked, respectively. The origin is displayed in Apify Console and available via [API](https://docs.apify.com/api/v2/actor-run-get) in the `meta.origin` field.
-|Name|Origin|
-|:---|:---|
-|`DEVELOPMENT`|Manually from Apify Console in the Development mode (own Actor)|
-|`WEB`|Manually from Apify Console in "normal" mode (someone else's Actor or task)|
-|`API`|From [Apify API](https://docs.apify.com/api)|
-|`CLI`|From [Apify CLI](https://docs.apify.com/cli/)|
-|`SCHEDULER`|Using a schedule|
-|`WEBHOOK`|Using a webhook|
-|`ACTOR`|From another Actor run|
-|`STANDBY`|From [Actor Standby](./standby)|
+| Name | Origin |
+| :------------ | :-------------------------------------------------------------------------- |
+| `DEVELOPMENT` | Manually from Apify Console in the Development mode (own Actor) |
+| `WEB` | Manually from Apify Console in "normal" mode (someone else's Actor or task) |
+| `API` | From [Apify API](https://docs.apify.com/api) |
+| `CLI` | From [Apify CLI](https://docs.apify.com/cli/) |
+| `SCHEDULER` | Using a schedule |
+| `WEBHOOK` | Using a webhook |
+| `ACTOR` | From another Actor run |
+| `STANDBY` | From [Actor Standby](./standby) |
## Lifecycle
@@ -81,16 +81,15 @@ flowchart LR
---
| Status | Type | Description |
-|:-----------|:-------------|:--------------------------------------------|
+| :--------- | :----------- | :------------------------------------------ |
| READY | initial | Started but not allocated to any worker yet |
| RUNNING | transitional | Executing on a worker machine |
| SUCCEEDED | terminal | Finished successfully |
| FAILED | terminal | Run failed |
| TIMING-OUT | transitional | Timing out now |
| TIMED-OUT | terminal | Timed out |
-| ABORTING | transitional | Being aborted by the user |
-| ABORTED | terminal | Aborted by the user |
-
+| ABORTING | transitional | Being aborted by the user |
+| ABORTED | terminal | Aborted by the user |
### Aborting runs
@@ -119,7 +118,7 @@ You can also adjust timeout and memory or change Actor build before the resurrec
1. Abort a broken run
2. Update the Actor's code and build the new version
3. Resurrect the run using the new build
-:::
+ :::
### Data retention
diff --git a/sources/platform/actors/running/store.md b/sources/platform/actors/running/store.md
index 5d2c8c9278..6926ba698d 100644
--- a/sources/platform/actors/running/store.md
+++ b/sources/platform/actors/running/store.md
@@ -20,7 +20,8 @@ Anyone is welcome to [publish Actors](/platform/actors/publishing) in the store,
## Pricing models
-[//]: # (TODO: link platform usage docs)
+[//]: # 'TODO: link platform usage docs'
+
All Actors in [Apify Store](https://apify.com/store) fall into one of the four pricing models:
1. [**Rental**](#rental-actors) - to continue using the Actor after the trial period, you must rent the Actor from the developer and pay a flat monthly fee in addition to the costs associated with the platform usage that the Actor generates.
@@ -40,7 +41,6 @@ Most rental Actors have a _free trial_ period. The length of the trial is displa
After a trial period, a flat monthly _Actor rental_ fee is automatically subtracted from your prepaid platform usage in advance for the following month. Most of this fee goes directly to the developer and is paid on top of the platform usage generated by the Actor. You can read more about our motivation for releasing rental Actors in [this blog post](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) from Apify's CEO Jan Čurn.
-
#### Rental Actors - Frequently Asked Questions
##### Can I run rental Actors via API or the Apify client?
@@ -49,7 +49,8 @@ Yes, when you are renting an Actor, you can run it using either our [API](/api/v
##### Do I pay platform costs for running rental Actors?
-[//]: # (TODO better link for platform usage costs explaining what it is!)
+[//]: # 'TODO better link for platform usage costs explaining what it is!'
+
Yes, you will pay normal [platform usage costs](https://apify.com/pricing) on top of the monthly Actor rental fee. The platform costs work exactly the same way as for free public Actors or your private Actors. You should find estimates of the cost of usage in each individual rental Actor's README ([see an example](https://apify.com/compass/crawler-google-places#how-much-will-it-cost)).
##### Do I need an Apify paid plan to use rental Actors?
@@ -151,7 +152,7 @@ Pay per event Actor pricing model is very similar to the pay per result model. Y
You will see that the Actor is paid per events next to the Actor name.
-[//]: # (TODO: also show the screenshot from Apify Store on Web)
+[//]: # 'TODO: also show the screenshot from Apify Store on Web'

diff --git a/sources/platform/actors/running/usage_and_resources.md b/sources/platform/actors/running/usage_and_resources.md
index 49bb245f51..03e5cd921e 100644
--- a/sources/platform/actors/running/usage_and_resources.md
+++ b/sources/platform/actors/running/usage_and_resources.md
@@ -37,15 +37,16 @@ The CPU allocation for an Actor is automatically computed based on the assigned
- For every `4096MB` of memory, the Actor receives one full CPU core
- If the memory allocation is not a multiple of `4096MB`, the CPU core allocation is calculated proportionally
- Examples:
- - `512MB` = 1/8 of a CPU core
- - `1024MB` = 1/4 of a CPU core
- - `8192MB` = 2 CPU cores
+ - `512MB` = 1/8 of a CPU core
+ - `1024MB` = 1/4 of a CPU core
+ - `8192MB` = 2 CPU cores
#### CPU usage spikes

-[//]: # (Is it still relevant though? Does it still get CPU boost?)
+[//]: # 'Is it still relevant though? Does it still get CPU boost?'
+
Sometimes, you see the Actor's CPU use go over 100%. This is not unusual. To help an Actor start up faster, it is allocated a free CPU boost. For example, if an Actor is assigned 1GB (25% of a core), it will temporarily be allowed to use 100% of the core, so it gets started quicker.
### Disk
@@ -54,16 +55,14 @@ The Actor has hard disk space limited by twice the amount of memory. For example
## Requirements
-Actors built with [Crawlee](https://crawlee.dev/) use autoscaling. This means that they will always run as efficiently as they can based on the allocated memory. If you double the allocated memory, the run should be twice as fast and consume the same amount of [compute units](#what-is-a-compute-unit) (1 * 1 = 0.5 * 2).
+Actors built with [Crawlee](https://crawlee.dev/) use autoscaling. This means that they will always run as efficiently as they can based on the allocated memory. If you double the allocated memory, the run should be twice as fast and consume the same amount of [compute units](#what-is-a-compute-unit) (1 _ 1 = 0.5 _ 2).
A good middle ground is `4096MB`. If you need the results faster, increase the memory (bear in mind the [next point](#maximum-memory), though). You can also try decreasing it to lower the pressure on the target site.
Autoscaling only applies to solutions that run multiple tasks (URLs) for at least 30 seconds. If you need to scrape just one URL or use Actors like [Google Sheets](https://apify.com/lukaskrivka/google-sheets) that do just a single isolated job, we recommend you lower the memory.
-[//]: # (TODO: It's pretty outdated, we now have platform credits in pricing)
-
-[//]: # (If you read that you can scrape 1000 pages of data for 1 CU and you want to scrape approximately 2 million of them monthly, that means you need 2000 CUs monthly and should [subscribe to the Business plan](https://console.apify.com/billing-new#/subscription).)
-
+[//]: # "TODO: It's pretty outdated, we now have platform credits in pricing"
+[//]: # 'If you read that you can scrape 1000 pages of data for 1 CU and you want to scrape approximately 2 million of them monthly, that means you need 2000 CUs monthly and should [subscribe to the Business plan](https://console.apify.com/billing-new#/subscription).'
If the Actor doesn't have this information, or you want to use your own solution, just run your solution like you want to use it long term. Let's say that you want to scrape the data **every hour for the whole month**. You set up a reasonable memory allocation like `4096MB`, and the whole run takes 15 minutes. That should consume 1 CU (4 \* 0.25 = 1). Now, you just need to multiply that by the number of hours in the day and by the number of days in the month, and you get an estimated usage of 720 (1 \* 24 \* 30) [compute units](#what-is-a-compute-unit) monthly.
@@ -108,7 +107,7 @@ To view the usage of an Actor run, navigate to the **Runs** section and check ou

- For a more detailed breakdown, click on the specific run you want to examine and then on the **?** icon next to the **Usage** label.
+For a more detailed breakdown, click on the specific run you want to examine and then on the **?** icon next to the **Usage** label.

diff --git a/sources/platform/collaboration/general-resource-access.md b/sources/platform/collaboration/general-resource-access.md
index 3731f9d5f9..80397908fb 100644
--- a/sources/platform/collaboration/general-resource-access.md
+++ b/sources/platform/collaboration/general-resource-access.md
@@ -20,9 +20,9 @@ This setting affects the following resources:
- Actor runs
- Actor builds
- Storages:
- - Datasets
- - Key-value stores
- - Request queues
+ - Datasets
+ - Key-value stores
+ - Request queues
Access to resources that require explicit access — such as Actors, tasks or schedules are not affected by this setting.
@@ -52,12 +52,10 @@ Because this is a new setting, some existing public Actors and integrations migh
:::
-
### Exceptions
Even if your access is set to **Restricted** there are a few built-in exceptions that make collaboration and platform features work seamlessly. These are explained in the sections below.
-
#### Builds of public Actors
Builds of public Actors are always accessible to anyone who can view the Actor — regardless of the Actor owner’s account **General resource access** setting.
@@ -83,7 +81,7 @@ If you’re using a public Actor from the Apify Store, you can choose to automat
- When enabled, your runs of public Actors are automatically visible to the Actor’s creator
- Shared runs include logs, input, and output storages (dataset, key-value store, request queue)
-This sharing works even if your account has **General resource access** set to **Restricted** — the platform applies specific permission checks to ensure the Actor creator can access only the relevant runs.
+This sharing works even if your account has **General resource access** set to **Restricted** — the platform applies specific permission checks to ensure the Actor creator can access only the relevant runs.
You can disable this behavior at any time by turning off the setting in your account.
@@ -95,9 +93,9 @@ This automatic sharing ensures the developer can view all the context they need
- Full access to the run itself (logs, input, status)
- Automatic access to the run’s default storages:
- - Dataset
- - Key-value store
- - Request queue
+ - Dataset
+ - Key-value store
+ - Request queue
The access is granted through explicit, behind-the-scenes permissions (not anonymous or public access), and is limited to just that run and its related storages. No other resources in your account are affected.
@@ -107,7 +105,7 @@ This means you don’t need to manually adjust permissions or share multiple lin
## Per-resource access control
-The account level access control can be changed on individual resources. This can be done by setting the general access level to other than Restricted in the share dialog for a given resource. This way the resource level setting takes precedence over the account setting.
+The account level access control can be changed on individual resources. This can be done by setting the general access level to other than Restricted in the share dialog for a given resource. This way the resource level setting takes precedence over the account setting.

@@ -117,7 +115,7 @@ You can also set the general access on a resource programmatically using the Api
```js
const datasetClient = apifyClient.dataset(datasetId);
await datasetClient.update({
- generalAccess: STORAGE_GENERAL_ACCESS.ANYONE_WITH_ID_CAN_READ
+ generalAccess: STORAGE_GENERAL_ACCESS.ANYONE_WITH_ID_CAN_READ,
});
```
@@ -139,23 +137,23 @@ The signature can be temporary (set to expire after a specified duration) or per
Only selected _dataset_ and _key-value store_ endpoints support pre-signed URLs.
This allows fine-grained control over what data can be shared without authentication.
-| Resource | Link | Validity | Notes |
-|-----------|-----------------------|------|-------|
-| _Datasets_ | [Dataset items](/api/v2/dataset-items-get) (`/v2/datasets/:datasetId/items`) | Temporary or Permanent | The link provides access to all dataset items. |
-| _Key-value stores_ | [List of keys](/api/v2/key-value-store-keys-get) (`/v2/key-value-stores/:storeId/keys`) | Temporary or Permanent | Returns the list of keys in a store. |
-| _Key-value stores_ | [Single record](/api/v2/key-value-store-record-get) (`/v2/key-value-stores/:storeId/records/:recordKey`) | _Permanent only_ | The public URL for a specific record is always permanent - it stays valid as long as the record exists. |
+| Resource | Link | Validity | Notes |
+| ------------------ | -------------------------------------------------------------------------------------------------------- | ---------------------- | ------------------------------------------------------------------------------------------------------- |
+| _Datasets_ | [Dataset items](/api/v2/dataset-items-get) (`/v2/datasets/:datasetId/items`) | Temporary or Permanent | The link provides access to all dataset items. |
+| _Key-value stores_ | [List of keys](/api/v2/key-value-store-keys-get) (`/v2/key-value-stores/:storeId/keys`) | Temporary or Permanent | Returns the list of keys in a store. |
+| _Key-value stores_ | [Single record](/api/v2/key-value-store-record-get) (`/v2/key-value-stores/:storeId/records/:recordKey`) | _Permanent only_ | The public URL for a specific record is always permanent - it stays valid as long as the record exists. |
:::info Automatically generated signed URLs
When you retrieve dataset or key-value store details using:
-- `GET https://api.apify.com/v2/datasets/:datasetId`
+- `GET https://api.apify.com/v2/datasets/:datasetId`
- `GET https://api.apify.com/v2/key-value-stores/:storeId`
-the API response includes automatically generated fields:
+the API response includes automatically generated fields:
-- `itemsPublicUrl` – a pre-signed URL providing access to dataset items
-- `keysPublicUrl` – a pre-signed URL providing access to key-value store keys
+- `itemsPublicUrl` – a pre-signed URL providing access to dataset items
+- `keysPublicUrl` – a pre-signed URL providing access to key-value store keys
These automatically generated URLs are _valid for 14 days_.
@@ -179,16 +177,16 @@ The link will include a signature _only if the general resource access is set to
##### Dataset items
-1. Click the **Export** button.
-2. In the modal that appears, click **Copy shareable link**.
+1. Click the **Export** button.
+2. In the modal that appears, click **Copy shareable link**.

##### Key-value store records
-1. Open a key-value store.
-2. Navigate to the record you want to share.
-3. In the **Actions** column, click the link icon to copy signed link.
+1. Open a key-value store.
+2. Navigate to the record you want to share.
+3. In the **Actions** column, click the link icon to copy signed link.

@@ -199,12 +197,14 @@ You can generate pre-signed URLs programmatically for datasets and key-value sto
##### Dataset items
```js
-import { ApifyClient } from "apify-client";
+import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: process.env.APIFY_TOKEN });
const datasetClient = client.dataset('my-dataset-id');
// Creates pre-signed URL for items (expires in 7 days)
-const itemsUrl = await datasetClient.createItemsPublicUrl({ expiresInSecs: 7 * 24 * 3600 });
+const itemsUrl = await datasetClient.createItemsPublicUrl({
+ expiresInSecs: 7 * 24 * 3600,
+});
// Creates permanent pre-signed URL for items
const permanentItemsUrl = await datasetClient.createItemsPublicUrl();
@@ -216,7 +216,9 @@ const permanentItemsUrl = await datasetClient.createItemsPublicUrl();
const storeClient = client.keyValueStore('my-store-id');
// Create pre-signed URL for list of keys (expires in 1 day)
-const keysPublicUrl = await storeClient.createKeysPublicUrl({ expiresInSecs: 24 * 3600 });
+const keysPublicUrl = await storeClient.createKeysPublicUrl({
+ expiresInSecs: 24 * 3600,
+});
// Create permanent pre-signed URL for list of keys
const permanentKeysPublicUrl = await storeClient.createKeysPublicUrl();
@@ -245,7 +247,7 @@ Manual signing uses standard _HMAC (SHA-256)_ with `urlSigningSecretKey` of the
### Sharing storages by name
-A convenient feature of storages is that you can name them. If you choose to do so there is an extra access level setting that applies to storages only, which is **Anyone with name or ID can read**. In that case anyone that knows the storage name is able to read it via API or view it using the storages Console URL.
+A convenient feature of storages is that you can name them. If you choose to do so there is an extra access level setting that applies to storages only, which is **Anyone with name or ID can read**. In that case anyone that knows the storage name is able to read it via API or view it using the storages Console URL.
:::tip Exposing public named datasets
@@ -259,7 +261,6 @@ If you own a public Actor in the Apify Store, you need to make sure that your Ac
In practice, this means that all API calls originating from the Actor need to have a valid API token. If you are using Apify SDK, this should be the default behavior. See the detailed guide below for more information.
-
:::caution Actor runs inherit user permissions
Keep in mind that when users run your public Actor, the Actor makes API calls under the user account, not your developer account. This means that it follows the _General resource access_ configuration of the user account. The configuration of your developer account has no effect on the Actor users.
@@ -280,9 +281,9 @@ When using the [Apify SDK](https://docs.apify.com/sdk/js/) or [Apify Client](htt
If your Actor makes direct API calls, include the API token manually:
```js
- const response = await fetch(`https://api.apify.com/v2/key-value-stores/${storeId}`, {
+const response = await fetch(`https://api.apify.com/v2/key-value-stores/${storeId}`, {
headers: { Authorization: `Bearer ${process.env.APIFY_TOKEN}` },
- });
+});
```
#### Generate pre-signed URLs for external sharing
@@ -292,7 +293,7 @@ If your Actor outputs or shares links to storages (such as datasets or key-value
For example:
```js
-import { ApifyClient } from "apify-client";
+import { ApifyClient } from 'apify-client';
// ❌ Avoid hardcoding raw API URLs
const recordUrl = `https://api.apify.com/v2/key-value-stores/${storeId}/records/${recordKey}`;
@@ -307,7 +308,6 @@ await Actor.pushData({ recordUrl });
To learn more about generating pre-signed URLs, refer to the section [Sharing restricted resources with pre-signed URLs](/platform/collaboration/general-resource-access#pre-signed-urls).
-
:::note Using Console URLs
Datasets and key-value stores also include a `consoleUrl` property.
@@ -327,4 +327,3 @@ You can easily test this by switching your own account’s setting to _Restricte
Once you’ve enabled restricted access, run your Actor and confirm that all links generated in logs, datasets, key-value stores, and status messages remain accessible as expected. Make sure any shared URLs — especially those stored in results or notifications — work without requiring an API token.
:::
-
diff --git a/sources/platform/collaboration/index.md b/sources/platform/collaboration/index.md
index 18aad02362..664f266e2f 100644
--- a/sources/platform/collaboration/index.md
+++ b/sources/platform/collaboration/index.md
@@ -9,16 +9,17 @@ slug: /collaboration
**Learn how to collaborate with other users and manage permissions for organizations or private resources such as Actors, Actor runs, and storages.**
---
+
Apify was built from the ground up as a collaborative platform. Whether you’re publishing your Actor in Apify Store or sharing a dataset with a teammate, collaboration is deeply integrated into how Apify works. You can share your resources (like Actors, runs, or storages) with others, manage permissions, or invite collaborators to your organization. By default, each system resource you create is only available to you, the owner. However, you can grant access to other users, making it easy to collaborate effectively and securely.
While most resources can be shared by assigning permissions (see [Access Rights](./access_rights.md)), some resources can also be shared simply by using their unique links or IDs. There are two types of resources in terms of sharing:
- _Resources that require explicit access by default:_
- - [Actors](../actors/running/index.md), [tasks](../actors/running/tasks.md)
- - Can be shared only by inviting collaborators using [Access Rights](./access_rights.md)) or using [Organization Accounts](./organization_account/index.md)
+ - [Actors](../actors/running/index.md), [tasks](../actors/running/tasks.md)
+ - Can be shared only by inviting collaborators using [Access Rights](./access_rights.md)) or using [Organization Accounts](./organization_account/index.md)
- _Resources supporting both explicit access and link sharing:_
- - Actor runs, Actor builds and storage resources (datasets, key-value stores, request queues)
- - Can be shared by inviting collaborators or simply by sharing a unique direct link
+ - Actor runs, Actor builds and storage resources (datasets, key-value stores, request queues)
+ - Can be shared by inviting collaborators or simply by sharing a unique direct link
You can control access to your resources in four ways:
diff --git a/sources/platform/collaboration/list_of_permissions.md b/sources/platform/collaboration/list_of_permissions.md
index 9b7d3b227e..9d414c220f 100644
--- a/sources/platform/collaboration/list_of_permissions.md
+++ b/sources/platform/collaboration/list_of_permissions.md
@@ -18,7 +18,7 @@ To learn about Apify Actors, check out the [documentation](../actors/index.mdx).
### Actor
| Permission | Description |
-|----------------------|------------------------------------------------------------|
+| -------------------- | ---------------------------------------------------------- |
| Read | View Actor settings, source code and builds. |
| Write | Edit Actor settings and source code, and delete the Actor. |
| Run | Run any of an Actor's builds. |
@@ -28,7 +28,7 @@ To learn about Apify Actors, check out the [documentation](../actors/index.mdx).
### Actor task
| Permission | Description |
-|----------------------|------------------------------------------------------------|
+| -------------------- | ---------------------------------------------------------- |
| Read | View task configuration. |
| Write | Edit task configuration and settings, and delete the task. |
| View runs | View a list of Actor task runs and their details. |
@@ -43,7 +43,7 @@ For more information about Storage, see its [documentation](../storage/index.md)
### Dataset
| Permission | Description |
-|----------------------|-----------------------------------------------------------------|
+| -------------------- | --------------------------------------------------------------- |
| Read | View dataset information and its data. |
| Write | Edit dataset settings, push data to it, and remove the dataset. |
| Manage access rights | Manage dataset access rights. |
@@ -53,7 +53,7 @@ To learn about dataset storage, see its [documentation](../storage/dataset.md).
### Key-value-store
| Permission | Description |
-|----------------------|---------------------------------------------------------------------------------------------------|
+| -------------------- | ------------------------------------------------------------------------------------------------- |
| Read | View key-value store details and records. |
| Write | Edit key-value store settings, add, update or remove its records, and delete the key-value store. |
| Manage access rights | Manage key-value store access rights. |
@@ -63,7 +63,7 @@ To learn about key-value stores, see the [documentation](../storage/key_value_st
### Request queue
| Permission | Description |
-|----------------------|------------------------------------------------------------------------------------------------|
+| -------------------- | ---------------------------------------------------------------------------------------------- |
| Read | View request queue details and records. |
| Write | Edit request queue settings, add, update, or remove its records, and delete the request queue. |
| Manage access rights | Manage request queue access rights. |
@@ -73,7 +73,7 @@ To learn about request queue storage, see the [documentation](../storage/request
## Proxy
| Permission | Description |
-|------------|---------------------------|
+| ---------- | ------------------------- |
| Proxy | Allow to use Apify Proxy. |
To learn about Apify Proxy, see its [documentation](../proxy/index.md).
@@ -83,7 +83,7 @@ To learn about Apify Proxy, see its [documentation](../proxy/index.md).
Permissions that can be granted to members of organizations. To learn about the organization account, see its [documentation](./organization_account/index.md).
| Permission | Description |
-|---------------------|-----------------------------------------------------------------------|
+| ------------------- | --------------------------------------------------------------------- |
| Manage access keys | Manage account access keys, i.e. API token and proxy password. |
| Update subscription | Update the type of subscription, billing details and payment methods. |
| Update profile | Make changes in profile information. |
diff --git a/sources/platform/collaboration/organization_account/index.md b/sources/platform/collaboration/organization_account/index.md
index 32842512ec..6898981d0e 100644
--- a/sources/platform/collaboration/organization_account/index.md
+++ b/sources/platform/collaboration/organization_account/index.md
@@ -15,8 +15,8 @@ You can [switch](./how_to_use.md) between your personal and organization account
You can set up an organization in two ways.
-* [Create a new organization](#create-a-new-organization). If you don't have integrations set up yet, or if they are easy to change, you can create a new organization, preserving your personal account.
-* [Convert an existing account](#convert-an-existing-account) into an organization. If your Actors and [integrations](../../integrations/index.mdx) are set up in a personal account, it is probably best to convert that account into an organization. This will preserve all your integrations but means you will have a new personal account created for you.
+- [Create a new organization](#create-a-new-organization). If you don't have integrations set up yet, or if they are easy to change, you can create a new organization, preserving your personal account.
+- [Convert an existing account](#convert-an-existing-account) into an organization. If your Actors and [integrations](../../integrations/index.mdx) are set up in a personal account, it is probably best to convert that account into an organization. This will preserve all your integrations but means you will have a new personal account created for you.
> Prefer video to reading? [See our video tutorial](https://www.youtube.com/watch?v=BIL6HqtnvKk) for organization accounts.
@@ -36,9 +36,9 @@ You can create a new organization by clicking the **Create new organization** bu
> **When you convert an existing user account into an organization,**
>
-> * **You will no longer be able to sign in to the converted user account.**
-> * **An organization cannot be converted back to a personal account.**
-> * **During conversion, a new account (with the same login credentials) will be created for you. You can then use that account to [set up](./setup.md) the organization.**
+> - **You will no longer be able to sign in to the converted user account.**
+> - **An organization cannot be converted back to a personal account.**
+> - **During conversion, a new account (with the same login credentials) will be created for you. You can then use that account to [set up](./setup.md) the organization.**
Before converting your personal account into an organization, make sure it has a **username**.
diff --git a/sources/platform/console/billing.md b/sources/platform/console/billing.md
index 223f937cff..a09226981c 100644
--- a/sources/platform/console/billing.md
+++ b/sources/platform/console/billing.md
@@ -27,7 +27,9 @@ The **Historical usage** tab provides a detailed view of your monthly platform u
The tab features an adjustable bar chart. This chart can be customized to display statistics either on a monthly or daily basis. Additionally, you can view these statistics as absolute or cumulative numbers, providing flexibility in how you analyze your usage data.
:::info Monthly usage data
+
Since billing cycles can shift, the data in the **Historical usage** tab is shown for calendar months.
+
:::

diff --git a/sources/platform/console/index.md b/sources/platform/console/index.md
index 78293381f6..4103aa78f4 100644
--- a/sources/platform/console/index.md
+++ b/sources/platform/console/index.md
@@ -24,7 +24,9 @@ This is the most common way of creating an account. You just need to provide you
After you click the **Sign up** button, we will send you a verification email. The email contains a link that you need to click on or copy to your browser to proceed to automated email verification. After we verify your email, you will proceed to Apify Console.
:::info CAPTCHA
-We are using Google reCaptcha to prevent spam accounts. Usually, you will not see it, but if Google evaluates your browser as suspicious, they will ask you to solve a reCaptcha before we create your account and send you the verification email.
+
+We are using Google reCAPTCHA to prevent spam accounts. Usually, you will not see it, but if Google evaluates your browser as suspicious, they will ask you to solve a reCAPTCHA before we create your account and send you the verification email.
+
:::
If you did not receive the email, you can visit the [sign-in page](https://console.apify.com/sign-in). There, you will either proceed to our verification page right away, or you can sign in and will be redirected afterward. On the verification page, you can click on the **Resend verification email** button to send the email again.
@@ -81,47 +83,48 @@ The Apify Console homepage provides an overview of your account setup. The heade
- **Suggested Actors for You**: Based on your and other users' recent activities, this section recommends Actors that might interest you.
- **Actor Runs**: This section is divided into two tabs:
- - **Recent**: View your latest Actor runs.
- - **Scheduled**: Check your upcoming scheduled runs and tasks.
+ - **Recent**: View your latest Actor runs.
+ - **Scheduled**: Check your upcoming scheduled runs and tasks.
Use the side menu to navigate other parts of Apify Console easily.
-#### Keyboard shortcuts
+
+### Keyboard shortcuts
You can also navigate Apify Console via keyboard shortcuts.
Keyboard Shortcuts
-|Shortcut| Tab |
-|:---|:----|
-|Show shortcuts | Shift? |
-|Home| GH |
-|Store| GO |
-|Actors| GA |
-|Development| GD |
-|Saved tasks| GT |
-|Runs| GR |
-|Integrations | GI |
-|Schedules| GU |
-|Storage| GE |
-|Proxy| GP |
-|Settings| GS |
-|Billing| GB |
+| Shortcut | Tab |
+| :------------- | :----- |
+| Show shortcuts | Shift? |
+| Home | GH |
+| Store | GO |
+| Actors | GA |
+| Development | GD |
+| Saved tasks | GT |
+| Runs | GR |
+| Integrations | GI |
+| Schedules | GU |
+| Storage | GE |
+| Proxy | GP |
+| Settings | GS |
+| Billing | GB |
-| Tab name | Description |
-|:---|:---|
-| [Apify Store](/platform/console/store)| Search for Actors that suit your web-scraping needs. |
-| [Actors](/platform/actors)| View recent & bookmarked Actors. |
-| [Runs](/platform/actors/running/runs-and-builds)| View your recent runs. |
-| [Saved tasks](/platform/actors/running/tasks)| View your saved tasks. |
-| [Schedules](/platform/schedules)| Schedule Actor runs & tasks to run at specified time. |
-| [Integrations](/platform/integrations)| View your integrations. |
-| [Development](/platform/actors/development)| • My Actors - See Actors developed by you. • Insights - see analytics for your Actors. • Messaging - check on issues reported in your Actors or send emails to users of your Actors. |
-| [Proxy](/platform/proxy)| View your proxy usage & credentials |
-| [Storage](/platform/storage)| View stored results of your runs in various data formats. |
-| [Billing](/platform/console/billing)| Billing information, statistics and invoices. |
-| [Settings](/platform/console/settings)| Settings of your account. |
+| Tab name | Description |
+| :----------------------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| [Apify Store](/platform/console/store) | Search for Actors that suit your web-scraping needs. |
+| [Actors](/platform/actors) | View recent & bookmarked Actors. |
+| [Runs](/platform/actors/running/runs-and-builds) | View your recent runs. |
+| [Saved tasks](/platform/actors/running/tasks) | View your saved tasks. |
+| [Schedules](/platform/schedules) | Schedule Actor runs & tasks to run at specified time. |
+| [Integrations](/platform/integrations) | View your integrations. |
+| [Development](/platform/actors/development) | • My Actors - See Actors developed by you. • Insights - see analytics for your Actors. • Messaging - check on issues reported in your Actors or send emails to users of your Actors. |
+| [Proxy](/platform/proxy) | View your proxy usage & credentials |
+| [Storage](/platform/storage) | View stored results of your runs in various data formats. |
+| [Billing](/platform/console/billing) | Billing information, statistics and invoices. |
+| [Settings](/platform/console/settings) | Settings of your account. |
diff --git a/sources/platform/console/settings.md b/sources/platform/console/settings.md
index ff829d36cb..7b9e7c7bdf 100644
--- a/sources/platform/console/settings.md
+++ b/sources/platform/console/settings.md
@@ -14,13 +14,13 @@ slug: /console/settings
By clicking the **Settings** tab on the side menu, you will be presented with an Account page where you can view & edit various settings regarding your account, such as:
-* account email
-* username
-* profile information
-* theme
-* login information
-* session information
-* account delete
+- account email
+- username
+- profile information
+- theme
+- login information
+- session information
+- account delete
:::info Verify your identity
@@ -28,7 +28,6 @@ The **Login & Privacy** tab (**Security & Privacy** for organization accounts) c
:::
-
### Session Information
In the **Session Information** section, you can adjust the session configuration. You can modify the default session lifespan of 90 days, this customization helps ensure compliance with organization security policies.
diff --git a/sources/platform/console/store.md b/sources/platform/console/store.md
index fd38e87c49..07707ab25c 100644
--- a/sources/platform/console/store.md
+++ b/sources/platform/console/store.md
@@ -17,11 +17,10 @@ Use the search box at the top of the page to find Actors by service names, such
Alternatively, you can explore Actors grouped under predefined categories below the search box.
You can also organize the results from the store by different criteria, including:
-* Category
-* Pricing model
-* Developers
-* Relevance
-
+- Category
+- Pricing model
+- Developers
+- Relevance
Once you select an Actor from the store, you'll be directed to its specific page. Here, you can configure the settings for your future Actor run, save these configurations for later use, or run the Actor immediately.
diff --git a/sources/platform/console/two-factor-authentication.md b/sources/platform/console/two-factor-authentication.md
index b92c28887a..091afab460 100644
--- a/sources/platform/console/two-factor-authentication.md
+++ b/sources/platform/console/two-factor-authentication.md
@@ -45,7 +45,9 @@ In this step, you will see 16 recovery codes. If you ever lose access to your au
Under the recovery codes, you will find two fields for your recovery information. These two fields are what the support team will ask you to provide in case you lose access to your authenticator app and also to your recovery codes. We will never use the phone number for anything other than to verify your identity and help you regain access to your account, only as a last resort. Ideally, the personal information you provide will be enough to verify your identity. Always provide both the kind of personal information you provide and the actual information.
:::info Personal information
+
What kind of personal information you provide is completely up to you. It does not even have to be personal, as long as it's secure and easy to remember. For example, it can be the name of your pet, the name of your favorite book, some secret code, or anything else. Keep in mind who has access to that information. While you can use the name of your pet, if you share information about your pet on public social media, it's not a good choice because anyone on the internet can access it. The same goes for any other information you provide.
+
:::
You will not be able to enable the two-factor authentication until you click on the **Download** / **Copy** buttons or copy the codes manually. After you do that, the **Continue** button will light up, and you can click on it to enable the two-factor authentication. The authentication process will then enable the two-factor authentication for your account and show a confirmation.
@@ -56,7 +58,6 @@ When you close the setup process, you should see that your two-factor authentica

-
## Verification after sign-in
After you enable two-factor authentication, the next time you attempt to sign in, you'll need to enter a code before you can get into the Apify Console. To do that, open your authenticator app and enter the code for your Apify account into the **Code** field. After you enter the code, click on the **Verify** button, and if the provided code is correct, you will proceed to Apify Console.
@@ -70,9 +71,10 @@ In case you lose access to your authenticator app, you can use the recovery code
If the provided recovery code is correct, you will proceed to Apify Console, the same as if you provided the code from the authenticator app. After gaining access to Apify Console, we recommend going to the [Login & Privacy](https://console.apify.com/settings/security) section of your account settings, disabling the two-factor authentication there, and then enabling it again with the new authenticator app.
:::info Removal of recovery codes
+
When you successfully use a recovery code, we remove the code from the original list as it's no longer possible to use it again. If you use all of your recovery codes, you will not be able to sign in to your account with them anymore, and you will need to either use your authenticator app or contact our support to help you regain access to your account.
-:::
+:::

@@ -91,7 +93,9 @@ If you lose access to your authenticator app and do not have any recovery codes
For our support team to help you recover your account, you will need to provide them with the personal information you have configured during the two-factor authentication setup. If you provide the correct information, the support team will help you regain access to your account.
:::caution
+
The support team will not give you any clues about the information you provided; they will only verify if it is correct.
+
:::
You can always check what information you provided by going to the [Login & Privacy](https://console.apify.com/settings/security) section of your account settings, to the **Two-factor authentication** section, and clicking on the **Recovery settings** button, then you should see a view like this:
diff --git a/sources/platform/integrations/actors/index.md b/sources/platform/integrations/actors/index.md
index 18435148fd..9643371b4f 100644
--- a/sources/platform/integrations/actors/index.md
+++ b/sources/platform/integrations/actors/index.md
@@ -25,12 +25,12 @@ To integrate one Actor with another:
1. Navigate to the **Integrations** tab in the Actor's detail page.
2. Select `Apify (Connect Actor or Task)`.
-
+ 
3. Find the Actor or task you want to integrate with and click `Connect`.
This leads you to a setup screen, where you can provide:
-- **Triggers**: Events that will trigger the integrated Actor. These are the same as webhook [event types](/platform/integrations/webhooks/events) (*run succeeded*, *build failed*, etc.)
+- **Triggers**: Events that will trigger the integrated Actor. These are the same as webhook [event types](/platform/integrations/webhooks/events) (_run succeeded_, _build failed_, etc.)

diff --git a/sources/platform/integrations/actors/integrating_actors_via_api.md b/sources/platform/integrations/actors/integrating_actors_via_api.md
index a2cca419c1..0711fbe4af 100644
--- a/sources/platform/integrations/actors/integrating_actors_via_api.md
+++ b/sources/platform/integrations/actors/integrating_actors_via_api.md
@@ -14,10 +14,10 @@ import TabItem from '@theme/TabItem';
You can integrate Actors via API using the [Create webhook](/api/v2/webhooks-post) endpoint. It's the same as any other webhook, but to make sure you see it in Apify Console, you need to make sure of a few things.
-* The `requestUrl` field needs to point to the **Run Actor** or **Run task** endpoints and needs to use their IDs as identifiers (i.e. not their technical names).
-* The `payloadTemplate` field should be valid JSON - i.e. it should only use variables enclosed in strings. You will also need to make sure that it contains a `payload` field.
-* The `shouldInterpolateStrings` field needs to be set to `true`, otherwise the variables won't work.
-* Add `isApifyIntegration` field with the value `true`. This is a helper that turns on the Actor integration UI, if the above conditions are met.
+- The `requestUrl` field needs to point to the **Run Actor** or **Run task** endpoints and needs to use their IDs as identifiers (i.e. not their technical names).
+- The `payloadTemplate` field should be valid JSON - i.e. it should only use variables enclosed in strings. You will also need to make sure that it contains a `payload` field.
+- The `shouldInterpolateStrings` field needs to be set to `true`, otherwise the variables won't work.
+- Add `isApifyIntegration` field with the value `true`. This is a helper that turns on the Actor integration UI, if the above conditions are met.
Not meeting the conditions does not mean that the webhook won't work; it will just be displayed as a regular HTTP webhook in Apify Console.
@@ -25,14 +25,14 @@ The webhook should look something like this:
```json5
{
- "requestUrl": "https://api.apify.com/v2/acts//runs",
- "eventTypes": ["ACTOR.RUN.SUCCEEDED"],
- "condition": {
- "actorId": "",
+ requestUrl: 'https://api.apify.com/v2/acts//runs',
+ eventTypes: ['ACTOR.RUN.SUCCEEDED'],
+ condition: {
+ actorId: '',
},
- "shouldInterpolateStrings": true,
- "isApifyIntegration": true,
- "payloadTemplate": "{\"field\":\"value\",\"payload\":{\"resource\":\"{{resource}}\"}}",
+ shouldInterpolateStrings: true,
+ isApifyIntegration: true,
+ payloadTemplate: '{"field":"value","payload":{"resource":"{{resource}}"}}',
}
```
diff --git a/sources/platform/integrations/actors/integration_ready_actors.md b/sources/platform/integrations/actors/integration_ready_actors.md
index b4f0e66fa2..bde2074ad9 100644
--- a/sources/platform/integrations/actors/integration_ready_actors.md
+++ b/sources/platform/integrations/actors/integration_ready_actors.md
@@ -62,7 +62,6 @@ However, if the Actor is **only** supposed to be used as integration, we can use
- `connectionString: string` - Credentials for the database connection
- `tableName: string` - Name of table / collection
-
In this case, users only need to provide the "static" part of the input:
```json
diff --git a/sources/platform/integrations/ai/aws_bedrock.md b/sources/platform/integrations/ai/aws_bedrock.md
index 0b22a86f22..eac8149085 100644
--- a/sources/platform/integrations/ai/aws_bedrock.md
+++ b/sources/platform/integrations/ai/aws_bedrock.md
@@ -100,11 +100,11 @@ The final step is to update the Lambda function to implement the OpenAPI schema
1. Open the Lambda function you created and copy-paste the [Python lambda function](https://raw.githubusercontent.com/apify/rag-web-browser/refs/heads/master/docs/aws-lambda-call-rag-web-browser.py).
1. Replace `APIFY_API_TOKEN` in the code with your Apify API token. Alternatively, store the token as an environment variable:
- - Go to the Configuration tab.
- - Select Environment Variables.
- - Add a new variable by specifying a key and value.
+ - Go to the Configuration tab.
+ - Select Environment Variables.
+ - Add a new variable by specifying a key and value.
1. Configure the Lambda function:
- - Set the memory allocation to 128 MB and timeout duration to 60 seconds.
+ - Set the memory allocation to 128 MB and timeout duration to 60 seconds.
1. Save the Lambda function and deploy it.
#### Step 4: Test the agent
diff --git a/sources/platform/integrations/ai/flowise.md b/sources/platform/integrations/ai/flowise.md
index fbffcfa60f..b2880251a1 100644
--- a/sources/platform/integrations/ai/flowise.md
+++ b/sources/platform/integrations/ai/flowise.md
@@ -66,5 +66,5 @@ For more information visit the Flowise [documentation](https://flowiseai.com/).
## Resources
-* [Flowise](https://flowiseai.com/)
-* [Flowise documentation](https://github.com/FlowiseAI/Flowise#quick-start)
+- [Flowise](https://flowiseai.com/)
+- [Flowise documentation](https://github.com/FlowiseAI/Flowise#quick-start)
diff --git a/sources/platform/integrations/ai/haystack.md b/sources/platform/integrations/ai/haystack.md
index 2c4993b630..6cc9c7c5d9 100644
--- a/sources/platform/integrations/ai/haystack.md
+++ b/sources/platform/integrations/ai/haystack.md
@@ -179,7 +179,6 @@ for doc in results["retriever"]["documents"]:
To run it, you can use the following command: `python apify_integration.py`
-
## Resources
- [Apify-haystack integration documentation](https://haystack.deepset.ai/integrations/apify)
diff --git a/sources/platform/integrations/ai/langchain.md b/sources/platform/integrations/ai/langchain.md
index 8d78ea758c..34adcb224a 100644
--- a/sources/platform/integrations/ai/langchain.md
+++ b/sources/platform/integrations/ai/langchain.md
@@ -16,7 +16,7 @@ In this example, we'll use the [Website Content Crawler](https://apify.com/apify
Then we feed the documents into a vector index and answer questions from it.
This example demonstrates how to integrate Apify with LangChain using the Python language.
-If you prefer to use JavaScript, you can follow the [JavaScript LangChain documentation](https://js.langchain.com/docs/integrations/document_loaders/web_loaders/apify_dataset/).
+If you prefer to use JavaScript, you can follow the [JavaScript LangChain documentation](https://js.langchain.com/docs/integrations/document_loaders/web_loaders/apify_dataset/).
Before we start with the integration, we need to install all dependencies:
diff --git a/sources/platform/integrations/ai/langflow.md b/sources/platform/integrations/ai/langflow.md
index e50fa72b7f..4711b070d6 100644
--- a/sources/platform/integrations/ai/langflow.md
+++ b/sources/platform/integrations/ai/langflow.md
@@ -70,12 +70,12 @@ To call Apify Actors in Langflow, you need to add the **Apify Actors** component
From the bundle menu, add **Apify Actors** component:

-Next, configure the Apify Actors components. First, input your API token (learn how to get it at [Integrations](https://docs.apify.com/platform/integrations/api)).
+Next, configure the Apify Actors components. First, input your API token (learn how to get it at [Integrations](https://docs.apify.com/platform/integrations/api)).
Then, set the Actor ID of the component to `apify/rag-web-browser` to use the [RAG Web Browser](https://apify.com/apify/rag-web-browser).
Set the **Run input** field to pass arguments to the Actor run, allowing it to search Google with the query `"what is monero?"` (full Actor input schema can be found in the [RAG Web Browser input schema](https://apify.com/apify/rag-web-browser/input-schema)):
```json
-{"query": "what is monero?", "maxResults": 3}
+{ "query": "what is monero?", "maxResults": 3 }
```
Click **Run**.
@@ -88,6 +88,7 @@ The output should look similar to this:

To filter only the `metadata` and `markdown` fields, set **Output fields** to `metadata,markdown`. Additionally, enable **Flatten output** by setting it to `true`. This will output only the metadata and text content from the search results.
+
> Flattening is necessary when you need to access nested dictionary fields in the output data object; they cannot be accessed directly otherwise in the Data object.

diff --git a/sources/platform/integrations/ai/langgraph.md b/sources/platform/integrations/ai/langgraph.md
index 27d76d9e41..6b03009edf 100644
--- a/sources/platform/integrations/ai/langgraph.md
+++ b/sources/platform/integrations/ai/langgraph.md
@@ -30,7 +30,7 @@ This guide will demonstrate how to use Apify Actors with LangGraph by building a
- **OpenAI API key**: In order to work with agents in LangGraph, you need an OpenAI API key. If you don't have one, you can get it from the [OpenAI platform](https://platform.openai.com/account/api-keys).
-- **Python packages**: You need to install the following Python packages:
+- **Python packages**: You need to install the following Python packages:
```bash
pip install langgraph langchain-apify langchain-openai
@@ -122,7 +122,6 @@ The OpenAI TikTok profile is titled "OpenAI (@openai) Official." Here are some k
```
-
If you want to test the whole example, you can simply create a new file, `langgraph_integration.py`, and copy the whole code into it.
```python
diff --git a/sources/platform/integrations/ai/lindy.md b/sources/platform/integrations/ai/lindy.md
index 6275507d56..661cb130ba 100644
--- a/sources/platform/integrations/ai/lindy.md
+++ b/sources/platform/integrations/ai/lindy.md
@@ -30,13 +30,11 @@ This section demonstrates how to integrate Apify's data extraction capabilities

-
1. Choose a trigger that will initiate your automation. For this demonstration, we will select **Chat with Lindy/Message received**. This allows you to trigger the Apify Actor simply by sending a message to Lindy.


-
1. After setting the trigger, select **Perform an Action**.

@@ -65,13 +63,13 @@ Lindy offers different triggers (e.g., _email received_, _Slack message received
After the Apify Actor run is initiated, you can define what happens next, depending on your needs:
- **When Actor Run Starts:**
- - You might want to send a notification.
- - Log the start time.
- - Run a pre-processing step.
+ - You might want to send a notification.
+ - Log the start time.
+ - Run a pre-processing step.
- **After Results Are Available:** Once the Apify Actor completes and its results are ready, you can:
- - Retrieve the Actor's output data from its dataset.
- - Pass the extracted data to Lindy's AI for summarization, analysis, content generation, or other AI-driven tasks.
- - Route the data to other services (e.g., Google Sheets, databases, email notifications) using Lindy's action modules.
+ - Retrieve the Actor's output data from its dataset.
+ - Pass the extracted data to Lindy's AI for summarization, analysis, content generation, or other AI-driven tasks.
+ - Route the data to other services (e.g., Google Sheets, databases, email notifications) using Lindy's action modules.
## Available Actions in Lindy for Apify
diff --git a/sources/platform/integrations/ai/llama.md b/sources/platform/integrations/ai/llama.md
index c83d86b656..4128e052ac 100644
--- a/sources/platform/integrations/ai/llama.md
+++ b/sources/platform/integrations/ai/llama.md
@@ -32,7 +32,6 @@ To use the Apify Actor, import `ApifyActor` and `Document`, and set your [Apify
The following example uses the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor to crawl an entire website, which will extract text content from the web pages.
The extracted text is formatted as a llama_index `Document` and can be fed to a vector store or language model like GPT.
-
```python
from llama_index.core import Document
from llama_index.readers.apify import ApifyActor
@@ -75,5 +74,5 @@ documents = reader.load_data(
## Resources
-* [Apify loaders](https://llamahub.ai/l/readers/llama-index-readers-apify)
-* [LlamaIndex documentation](https://docs.llamaindex.ai/en/stable/)
+- [Apify loaders](https://llamahub.ai/l/readers/llama-index-readers-apify)
+- [LlamaIndex documentation](https://docs.llamaindex.ai/en/stable/)
diff --git a/sources/platform/integrations/ai/mastra.md b/sources/platform/integrations/ai/mastra.md
index 1de306904d..d050199058 100644
--- a/sources/platform/integrations/ai/mastra.md
+++ b/sources/platform/integrations/ai/mastra.md
@@ -35,9 +35,9 @@ This guide demonstrates how to integrate Apify Actors with Mastra by building an
- _Node.js_: Ensure you have Node.js installed.
- _Packages_: Install the following packages:
- ```bash
- npm install @mastra/core @mastra/mcp @ai-sdk/openai
- ```
+ ```bash
+ npm install @mastra/core @mastra/mcp @ai-sdk/openai
+ ```
### Building the TikTok profile search and analysis agent
@@ -54,8 +54,8 @@ import { openai } from '@ai-sdk/openai';
Next, set the environment variables for the Apify API token and OpenAI API key:
```typescript
-process.env.APIFY_TOKEN = "your-apify-token";
-process.env.OPENAI_API_KEY = "your-openai-api-key";
+process.env.APIFY_TOKEN = 'your-apify-token';
+process.env.OPENAI_API_KEY = 'your-openai-api-key';
// For Anthropic use
// process.env.ANTHROPIC_API_KEY = "your-anthropic-api-key";
```
@@ -68,7 +68,7 @@ const mcpClient = new MastraMCPClient({
server: {
url: new URL('https://mcp.apify.com/sse'),
requestInit: {
- headers: { Authorization: `Bearer ${process.env.APIFY_TOKEN}` }
+ headers: { Authorization: `Bearer ${process.env.APIFY_TOKEN}` },
},
// The EventSource package augments EventSourceInit with a "fetch" parameter.
// You can use this to set additional headers on the outgoing request.
@@ -78,8 +78,8 @@ const mcpClient = new MastraMCPClient({
const headers = new Headers(init?.headers || {});
headers.set('authorization', `Bearer ${process.env.APIFY_TOKEN}`);
return fetch(input, { ...init, headers });
- }
- }
+ },
+ },
},
timeout: 300_000, // 5 minutes tool call timeout
});
@@ -99,19 +99,21 @@ Instantiate the agent with the OpenAI model:
```typescript
const agent = new Agent({
name: 'Social Media Agent',
- instructions: 'You’re a social media data extractor. Find TikTok URLs and analyze profiles with precision.',
+ instructions:
+ 'You’re a social media data extractor. Find TikTok URLs and analyze profiles with precision.',
// You can swap to any other AI-SDK LLM provider
- model: openai('gpt-4o-mini')
+ model: openai('gpt-4o-mini'),
});
```
Generate a response using the agent and the Apify tools:
```typescript
-const prompt = 'Search the web for the OpenAI TikTok profile URL, then extract and summarize its data.';
+const prompt =
+ 'Search the web for the OpenAI TikTok profile URL, then extract and summarize its data.';
console.log(`Generating response for prompt: ${prompt}`);
const response = await agent.generate(prompt, {
- toolsets: { apify: tools }
+ toolsets: { apify: tools },
});
```
@@ -164,8 +166,8 @@ import { openai } from '@ai-sdk/openai';
// For Anthropic use
// import { anthropic } from '@ai-sdk/anthropic';
-process.env.APIFY_TOKEN = "your-apify-token";
-process.env.OPENAI_API_KEY = "your-openai-api-key";
+process.env.APIFY_TOKEN = 'your-apify-token';
+process.env.OPENAI_API_KEY = 'your-openai-api-key';
// For Anthropic use
// process.env.ANTHROPIC_API_KEY = "your-anthropic-api-key";
@@ -174,18 +176,18 @@ const mcpClient = new MastraMCPClient({
server: {
url: new URL('https://mcp.apify.com/sse'),
requestInit: {
- headers: { Authorization: `Bearer ${process.env.APIFY_TOKEN}` }
+ headers: { Authorization: `Bearer ${process.env.APIFY_TOKEN}` },
},
// The EventSource package augments EventSourceInit with a "fetch" parameter.
// You can use this to set additional headers on the outgoing request.
// Based on this example: https://github.com/modelcontextprotocol/typescript-sdk/issues/118
eventSourceInit: {
async fetch(input: Request | URL | string, init?: RequestInit) {
- const headers = new Headers(init?.headers || {});
- headers.set('authorization', `Bearer ${process.env.APIFY_TOKEN}`);
- return fetch(input, { ...init, headers });
- }
- }
+ const headers = new Headers(init?.headers || {});
+ headers.set('authorization', `Bearer ${process.env.APIFY_TOKEN}`);
+ return fetch(input, { ...init, headers });
+ },
+ },
},
timeout: 300_000, // 5 minutes tool call timeout
});
@@ -197,15 +199,17 @@ const tools = await mcpClient.tools();
const agent = new Agent({
name: 'Social Media Agent',
- instructions: 'You’re a social media data extractor. Find TikTok URLs and analyze profiles with precision.',
+ instructions:
+ 'You’re a social media data extractor. Find TikTok URLs and analyze profiles with precision.',
// You can swap to any other AI-SDK LLM provider
- model: openai('gpt-4o-mini')
+ model: openai('gpt-4o-mini'),
});
-const prompt = 'Search the web for the OpenAI TikTok profile URL, then extract and summarize its data.';
+const prompt =
+ 'Search the web for the OpenAI TikTok profile URL, then extract and summarize its data.';
console.log(`Generating response for prompt: ${prompt}`);
const response = await agent.generate(prompt, {
- toolsets: { apify: tools }
+ toolsets: { apify: tools },
});
console.log(response.text);
diff --git a/sources/platform/integrations/ai/mcp.md b/sources/platform/integrations/ai/mcp.md
index c74bf5b356..cb3e62ea45 100644
--- a/sources/platform/integrations/ai/mcp.md
+++ b/sources/platform/integrations/ai/mcp.md
@@ -43,11 +43,11 @@ authentication without exposing your API token.
```json
{
- "mcpServers": {
- "apify": {
- "url": "https://mcp.apify.com"
+ "mcpServers": {
+ "apify": {
+ "url": "https://mcp.apify.com"
+ }
}
- }
}
```
@@ -58,14 +58,14 @@ You can also use your Apify token directly, instead of OAuth, by setting the `Au
```json
{
- "mcpServers": {
- "apify": {
- "url": "https://mcp.apify.com",
- "headers": {
- "Authorization": "Bearer "
- }
+ "mcpServers": {
+ "apify": {
+ "url": "https://mcp.apify.com",
+ "headers": {
+ "Authorization": "Bearer "
+ }
+ }
}
- }
}
```
@@ -103,11 +103,11 @@ To add Apify MCP server to Cursor manually:
```json
{
- "mcpServers": {
- "apify": {
- "url": "https://mcp.apify.com"
+ "mcpServers": {
+ "apify": {
+ "url": "https://mcp.apify.com"
+ }
}
- }
}
```
@@ -120,14 +120,14 @@ To add Apify MCP server to Cursor manually:
```json
{
- "mcpServers": {
- "apify": {
- "url": "https://mcp.apify.com",
- "headers": {
- "Authorization": "Bearer "
- }
+ "mcpServers": {
+ "apify": {
+ "url": "https://mcp.apify.com",
+ "headers": {
+ "Authorization": "Bearer "
+ }
+ }
}
- }
}
```
@@ -149,7 +149,7 @@ VS Code supports MCP through GitHub Copilot's agent mode (requires Copilot subsc
1. Ensure you have GitHub Copilot installed
1. Open Command Palette (CMD/CTRL + Shift + P) and run _MCP: Open User Configuration_ command.
- - This will open `mcp.json` file in your user profile. If the file does not exist, VS Code creates it for you.
+ - This will open `mcp.json` file in your user profile. If the file does not exist, VS Code creates it for you.
1. Add the following to the configuration file:
@@ -157,11 +157,11 @@ VS Code supports MCP through GitHub Copilot's agent mode (requires Copilot subsc
```json
{
- "mcpServers": {
- "apify": {
- "url": "https://mcp.apify.com"
+ "mcpServers": {
+ "apify": {
+ "url": "https://mcp.apify.com"
+ }
}
- }
}
```
@@ -174,14 +174,14 @@ VS Code supports MCP through GitHub Copilot's agent mode (requires Copilot subsc
```json
{
- "mcpServers": {
- "apify": {
- "url": "https://mcp.apify.com",
- "headers": {
- "Authorization": "Bearer "
- }
+ "mcpServers": {
+ "apify": {
+ "url": "https://mcp.apify.com",
+ "headers": {
+ "Authorization": "Bearer "
+ }
+ }
}
- }
}
```
@@ -207,15 +207,15 @@ To manually configure Apify's MCP server for Claude Desktop:
```json
{
- "mcpServers": {
- "actors-mcp-server": {
- "command": "npx",
- "args": ["-y", "@apify/actors-mcp-server"],
- "env": {
- "APIFY_TOKEN": ""
- }
+ "mcpServers": {
+ "actors-mcp-server": {
+ "command": "npx",
+ "args": ["-y", "@apify/actors-mcp-server"],
+ "env": {
+ "APIFY_TOKEN": ""
+ }
+ }
}
- }
}
```
@@ -232,15 +232,15 @@ Add this to your configuration file:
```json
{
- "mcpServers": {
- "actors-mcp-server": {
- "command": "npx",
- "args": ["-y", "@apify/actors-mcp-server"],
- "env": {
- "APIFY_TOKEN": "YOUR_APIFY_TOKEN"
- }
+ "mcpServers": {
+ "actors-mcp-server": {
+ "command": "npx",
+ "args": ["-y", "@apify/actors-mcp-server"],
+ "env": {
+ "APIFY_TOKEN": "YOUR_APIFY_TOKEN"
+ }
+ }
}
- }
}
```
@@ -269,28 +269,27 @@ Use the UI configurator `https://mcp.apify.com/` to select your tools visually,
### Available tools
-| Tool name | Category | Enabled by default | Description |
-| :--- | :--- | :--- | :--- |
-| `search-actors` | actors | ✅ | Search for Actors in Apify Store |
-| `fetch-actor-details` | actors | ✅ | Retrieve detailed information about a specific Actor |
-| `call-actor`* | actors | ❔ | Call an Actor and get its run results |
-| [`apify/rag-web-browser`](https://apify.com/apify/rag-web-browser) | Actor | ✅ | Browse and extract web data |
-| `search-apify-docs` | docs | ✅ | Search the Apify documentation for relevant pages |
-| `fetch-apify-docs` | docs | ✅ | Fetch the full content of an Apify documentation page by its URL |
-| `get-actor-run` | runs | | Get detailed information about a specific Actor run |
-| `get-actor-run-list` | runs | | Get a list of an Actor's runs, filterable by status |
-| `get-actor-log` | runs | | Retrieve the logs for a specific Actor run |
-| `get-dataset` | storage | | Get metadata about a specific dataset |
-| `get-dataset-items` | storage | | Retrieve items from a dataset with support for filtering and pagination |
-| `get-dataset-schema` | storage | | Generate a JSON schema from dataset items |
-| `get-key-value-store` | storage | | Get metadata about a specific key-value store |
-| `get-key-value-store-keys`| storage | | List the keys within a specific key-value store |
-| `get-key-value-store-record`| storage | | Get the value associated with a specific key in a key-value store |
-| `get-dataset-list` | storage | | List all available datasets for the user |
-| `get-key-value-store-list`| storage | | List all available key-value stores for the user |
-| `add-actor`* | experimental | ❔ | Add an Actor as a new tool for the user to call |
-| `get-actor-output`* | - | ✅ | Retrieve the output from an Actor call which is not included in the output preview of the Actor tool. |
-
+| Tool name | Category | Enabled by default | Description |
+| :----------------------------------------------------------------- | :----------- | :----------------- | :---------------------------------------------------------------------------------------------------- |
+| `search-actors` | actors | ✅ | Search for Actors in Apify Store |
+| `fetch-actor-details` | actors | ✅ | Retrieve detailed information about a specific Actor |
+| `call-actor`\* | actors | ❔ | Call an Actor and get its run results |
+| [`apify/rag-web-browser`](https://apify.com/apify/rag-web-browser) | Actor | ✅ | Browse and extract web data |
+| `search-apify-docs` | docs | ✅ | Search the Apify documentation for relevant pages |
+| `fetch-apify-docs` | docs | ✅ | Fetch the full content of an Apify documentation page by its URL |
+| `get-actor-run` | runs | | Get detailed information about a specific Actor run |
+| `get-actor-run-list` | runs | | Get a list of an Actor's runs, filterable by status |
+| `get-actor-log` | runs | | Retrieve the logs for a specific Actor run |
+| `get-dataset` | storage | | Get metadata about a specific dataset |
+| `get-dataset-items` | storage | | Retrieve items from a dataset with support for filtering and pagination |
+| `get-dataset-schema` | storage | | Generate a JSON schema from dataset items |
+| `get-key-value-store` | storage | | Get metadata about a specific key-value store |
+| `get-key-value-store-keys` | storage | | List the keys within a specific key-value store |
+| `get-key-value-store-record` | storage | | Get the value associated with a specific key in a key-value store |
+| `get-dataset-list` | storage | | List all available datasets for the user |
+| `get-key-value-store-list` | storage | | List all available key-value stores for the user |
+| `add-actor`\* | experimental | ❔ | Add an Actor as a new tool for the user to call |
+| `get-actor-output`\* | - | ✅ | Retrieve the output from an Actor call which is not included in the output preview of the Actor tool. |
:::note Retrieving full output
@@ -306,7 +305,6 @@ It can search Apify Store for relevant Actors using the `search-actors` tool, in
This dynamic discovery means your AI can adapt to new tasks without manual configuration.
Each discovered Actor becomes immediately available for future use in the conversation.
-
:::note Dynamic tool discovery
When you use the `actors` tool category, clients that support dynamic tool discovery (such as Claude.ai web and VS Code) will automatically receive the `add-actor` tool instead of `call-actor` for enhanced Actor discovery capabilities.
@@ -314,7 +312,6 @@ For a detailed overview of client support for dynamic discovery, see the [MCP cl
:::
-
## Advanced usage
### Production best practices
@@ -332,6 +329,7 @@ The Apify MCP server allows up to _30_ requests per second per user. This limit
documentation queries. If you exceed this limit, you'll receive a `429` response and should implement appropriate retry logic.
+
## Troubleshooting
##### Authentication errors
@@ -349,6 +347,7 @@ documentation queries. If you exceed this limit, you'll receive a `429` response
- _No response or long delays_: Actor runs can take time to complete depending on their task. If you're experiencing long delays, check the Actor's logs in Apify Console. The logs will provide insight into the Actor's status and show if it's processing a long operation or has encountered an error.
+
## Support and resources
The Apify MCP Server is an open-source project. Report bugs, suggest features, or ask questions in the [GitHub repository](https://github.com/apify/apify-mcp-server/issues).
diff --git a/sources/platform/integrations/ai/milvus.md b/sources/platform/integrations/ai/milvus.md
index af19198323..af230c6bcb 100644
--- a/sources/platform/integrations/ai/milvus.md
+++ b/sources/platform/integrations/ai/milvus.md
@@ -37,7 +37,6 @@ It will be automatically created when data is uploaded to the database.
Once the cluster is ready, and you have the `URI` and `Token`, you can set up the integration with Apify.
-
### Integration Methods
You can integrate Apify with Milvus using either the Apify Console or the Apify Python SDK.
@@ -86,14 +85,12 @@ Another way to interact with Milvus is through the [Apify Python SDK](https://do
1. Call the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor to crawl the Milvus documentation and Zilliz website and extract text content from the web pages:
-
```python
actor_call = client.actor("apify/website-content-crawler").call(
run_input={"maxCrawlPages": 10, "startUrls": [{"url": "https://milvus.io/"}, {"url": "https://zilliz.com/"}]}
)
```
-
1. Call Apify's Milvus integration and store all data in the Milvus Vector Database:
```python
diff --git a/sources/platform/integrations/ai/pinecone.md b/sources/platform/integrations/ai/pinecone.md
index f2325a55b8..2998ee4622 100644
--- a/sources/platform/integrations/ai/pinecone.md
+++ b/sources/platform/integrations/ai/pinecone.md
@@ -31,7 +31,7 @@ Before you begin, ensure that you have the following:
1. Specify the following details: index name, vector dimension, vector distance metric, deployment type (serverless or pod), and cloud provider.
- 
+ 
Once the index is created and ready, you can proceed with integrating Apify.
@@ -55,7 +55,7 @@ The examples utilize the Website Content Crawler Actor, which deeply crawls webs
1. Select when to trigger this integration (typically when a run succeeds) and fill in all the required fields for the Pinecone integration. You can learn more about the input parameters at the [Pinecone integration input schema](https://apify.com/apify/pinecone-integration/input-schema).
- 
+ 
:::note Pinecone index configuration
diff --git a/sources/platform/integrations/ai/qdrant.md b/sources/platform/integrations/ai/qdrant.md
index de8ad5bbde..97df3814d1 100644
--- a/sources/platform/integrations/ai/qdrant.md
+++ b/sources/platform/integrations/ai/qdrant.md
@@ -35,7 +35,6 @@ Before you begin, ensure that you have the following:
With the cluster ready and its URL and API key in hand, you can proceed with integrating Apify.
-
### Integration Methods
You can integrate Apify with Qdrant using either the Apify Console or the Apify Python SDK.
@@ -56,7 +55,7 @@ The examples utilize the Website Content Crawler Actor, which deeply crawls webs
1. Select when to trigger this integration (typically when a run succeeds) and fill in all the required fields for the Qdrant integration. If you haven't created a collection, it can be created automatically with the specified model. You can learn more about the input parameters at the [Qdrant integration input schema](https://apify.com/apify/qdrant-integration).
- 
+ 
- For a detailed explanation of the input parameters, including dataset settings, incremental updates, and examples, see the [Qdrant integration description](https://apify.com/apify/qdrant-integration).
diff --git a/sources/platform/integrations/ai/vercel-ai-sdk.md b/sources/platform/integrations/ai/vercel-ai-sdk.md
index b73aa55a7b..6e131d12bd 100644
--- a/sources/platform/integrations/ai/vercel-ai-sdk.md
+++ b/sources/platform/integrations/ai/vercel-ai-sdk.md
@@ -24,7 +24,6 @@ For more in-depth details, check out [Vercel AI SDK documentation](https://ai-sd
Apify is a marketplace of ready-to-use web scraping and automation tools, AI agents, and MCP servers that you can equip your own AI with. This guide demonstrates how to use Apify tools with a simple AI agent built with Vercel AI SDK.
-
### Prerequisites
- _Apify API token_: To use Apify Actors in Vercel AI SDK, you need an Apify API token. To obtain your token check [Apify documentation](https://docs.apify.com/platform/integrations/api).
@@ -56,13 +55,13 @@ Make sure to set the `APIFY_TOKEN` environment variable with your Apify API toke
// Connect to the Apify MCP server and get the available tools
const url = new URL('https://mcp.apify.com');
const mcpClient = await createMCPClient({
- transport: new StreamableHTTPClientTransport(url, {
- requestInit: {
- headers: {
- "Authorization": `Bearer ${process.env.APIFY_TOKEN}`
- }
- }
- }),
+ transport: new StreamableHTTPClientTransport(url, {
+ requestInit: {
+ headers: {
+ Authorization: `Bearer ${process.env.APIFY_TOKEN}`,
+ },
+ },
+ }),
});
const tools = await mcpClient.tools();
console.log('Tools available:', Object.keys(tools).join(', '));
@@ -82,8 +81,8 @@ const openrouter = createOpenRouter({
baseURL: 'https://openrouter.apify.actor/api/v1',
apiKey: 'api-key-not-required',
headers: {
- "Authorization": `Bearer ${process.env.APIFY_TOKEN}`
- }
+ Authorization: `Bearer ${process.env.APIFY_TOKEN}`,
+ },
});
```
@@ -98,7 +97,12 @@ const response = await generateText({
messages: [
{
role: 'user',
- content: [{ type: 'text', text: 'Find a pub near the Ferry Building in San Francisco using the Google Maps scraper.' }],
+ content: [
+ {
+ type: 'text',
+ text: 'Find a pub near the Ferry Building in San Francisco using the Google Maps scraper.',
+ },
+ ],
},
],
});
diff --git a/sources/platform/integrations/data-storage/airbyte.md b/sources/platform/integrations/data-storage/airbyte.md
index 8ba62f75e0..ed4d170ec3 100644
--- a/sources/platform/integrations/data-storage/airbyte.md
+++ b/sources/platform/integrations/data-storage/airbyte.md
@@ -15,8 +15,8 @@ One of these connectors is the Apify Dataset connector, which makes it simple to
To use Airbyte's Apify connector you need to:
-* Have an Apify account.
-* Have an Airbyte account.
+- Have an Apify account.
+- Have an Airbyte account.
## Set up Apify connector in Airbyte
diff --git a/sources/platform/integrations/data-storage/airtable/console_integration.md b/sources/platform/integrations/data-storage/airtable/console_integration.md
index b3bac3c459..c4b8249262 100644
--- a/sources/platform/integrations/data-storage/airtable/console_integration.md
+++ b/sources/platform/integrations/data-storage/airtable/console_integration.md
@@ -10,7 +10,7 @@ slug: /integrations/airtable/console
---
-[Airtable](https://www.airtable.com/) is a cloud-based platform for organizing, managing, and collaborating on data. With Apify integration for Airtable, you can automatically upload Actor run results to Airtable after a successful run.
+[Airtable](https://www.airtable.com/) is a cloud-based platform for organizing, managing, and collaborating on data. With Apify integration for Airtable, you can automatically upload Actor run results to Airtable after a successful run.
This integration uses OAuth 2.0, a secure authorization protocol, to connect your Airtable account to Apify and manage data transfers.
@@ -45,19 +45,18 @@ To use the Apify integration for Airtable, ensure you have:

1. Select the upload mode:
- - **CREATE**: New table is created for each run of this integration.
- - **APPEND**: New records are added to the specified table. If the table does not yet exist, new one is created.
- - **OVERWRITE**: All records in the specified table are replaced with new data. If the table does not yet exist, new one is created.
+ - **CREATE**: New table is created for each run of this integration.
+ - **APPEND**: New records are added to the specified table. If the table does not yet exist, new one is created.
+ - **OVERWRITE**: All records in the specified table are replaced with new data. If the table does not yet exist, new one is created.
1. Select a connected Airtable account and choose the base where the Actor run results will be uploaded.
1. Enter a table name or select an existing one.
- To ensure uniqueness when using CREATE mode, use dynamic variables. If a table with the same name already exists in CREATE mode, a random token will be appended.
+ To ensure uniqueness when using CREATE mode, use dynamic variables. If a table with the same name already exists in CREATE mode, a random token will be appended.

1. Save the integration. Once your Actor runs, you'll see its results uploaded to Airtable.

-
diff --git a/sources/platform/integrations/data-storage/airtable/index.md b/sources/platform/integrations/data-storage/airtable/index.md
index fb7e280435..452c206c5c 100644
--- a/sources/platform/integrations/data-storage/airtable/index.md
+++ b/sources/platform/integrations/data-storage/airtable/index.md
@@ -10,7 +10,7 @@ slug: /integrations/airtable
---
-[Airtable](https://www.airtable.com/) is a cloud-based platform for organizing, managing, and collaborating on data. With the Apify integration for Airtable, you can automatically upload Actor run results to Airtable after a successful run.
+[Airtable](https://www.airtable.com/) is a cloud-based platform for organizing, managing, and collaborating on data. With the Apify integration for Airtable, you can automatically upload Actor run results to Airtable after a successful run.
This integration uses OAuth 2.0, a secure authorization protocol, to connect your Airtable account to Apify and manage data transfers.
@@ -39,6 +39,7 @@ Go to [Airtable](https://airtable.com) and open the base you would like to work

+
Search for Apify extenison and install it

@@ -65,13 +66,13 @@ The extension provides the following capabilities:
### Run Actor
1. Select any Actor from **Apify store** or **recently used Actors**
-
+ 
1. Fill in the Actor input form.
-
+ 
1. Run the Actor and wait for results
-
+ 
### Run task
@@ -79,7 +80,6 @@ You can select and run any saved Apify task directly from the extension to reuse

-
### Get dataset items
Retrieve items from any Apify dataset and import them into your Airtable base with a single click.
@@ -107,8 +107,8 @@ A period (`.`) in field labels indicates nested elements within an object.
```json
{
- crawl: {
- depth: 'the field you selected',
+ "crawl": {
+ "depth": "the field you selected"
}
}
```
diff --git a/sources/platform/integrations/data-storage/drive.md b/sources/platform/integrations/data-storage/drive.md
index 44fd83524d..0518995b3c 100644
--- a/sources/platform/integrations/data-storage/drive.md
+++ b/sources/platform/integrations/data-storage/drive.md
@@ -22,14 +22,14 @@ To use the Apify integration for Google Drive, you will need:
1. Head over to **Integrations** tab in your saved task and click on the **Upload file** integration.
- 
+
1. Click on **Connect with Google** button and select the account with which you want to use the integration.
- 
+
1. Set up the integration details. You can choose the **Filename** and **Format** , which can make use of available variables.
-The file will be uploaded to your Google Drive account to `Apify Uploads` folder. By default, the integration is triggered by successful runs only.
+ The file will be uploaded to your Google Drive account to `Apify Uploads` folder. By default, the integration is triggered by successful runs only.

diff --git a/sources/platform/integrations/data-storage/keboola.md b/sources/platform/integrations/data-storage/keboola.md
index e75ba93576..3b98df8358 100644
--- a/sources/platform/integrations/data-storage/keboola.md
+++ b/sources/platform/integrations/data-storage/keboola.md
@@ -41,7 +41,6 @@ With the new configuration created, you can now configure the data source to ret

-
#### Choose an action
In the next step, you can choose the action you want to perform:
@@ -66,7 +65,7 @@ In the specifications step, you can set up various options for your Actor run:
- **Actor**: Select the Actor you want to run from your Apify account.
- **Input Table**: Choose a table from the Keboola platform to be sent to the Actor as input data.
-- **Output field**: Comma-separated list of fields to be picked from the dataset.
+- **Output field**: Comma-separated list of fields to be picked from the dataset.
- **Memory**: Adjust the memory settings if needed (the default values can be kept).
- **Build**: Adjust if you want to run a specific build of an Actor. Tag or number of the build to run.
- **Actor Input**: Pass any JSON data as input to the Actor.
diff --git a/sources/platform/integrations/integrate_with_apify.md b/sources/platform/integrations/integrate_with_apify.md
index a99f6176e3..4fbbedc0ef 100644
--- a/sources/platform/integrations/integrate_with_apify.md
+++ b/sources/platform/integrations/integrate_with_apify.md
@@ -58,6 +58,7 @@ An alternative way is to let your users manage the connection directly on your s
Apify supports two main authentication methods for secure API access.
_OAuth 2.0_ - Use OAuth 2.0 to allow users to authorize your integration without sharing their credentials.
+
_API token_ - Apify user generates personal API token from Apify account settings page. For more information, see [API Token documentation](https://docs.apify.com/platform/integrations/api#api-token).
@@ -117,9 +118,9 @@ Recommended features:
- _URL_: that you intend to scrape (string)
- _Crawler type_: Dropdown menu, allowing users to choose from the following options:
- - _Headless web browser_ - Useful for websites with anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and automatically retries blocked requests through new sessions.
- - _Stealthy web browser (default)_ - Another headless web browser with anti-blocking measures enabled. Try this if you encounter anti-bot protections while scraping.
- - _Raw HTTP client_ - High-performance crawling mode that uses raw HTTP requests to fetch pages. It's faster and cheaper, but might not work on all websites.
+ - _Headless web browser_ - Useful for websites with anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and automatically retries blocked requests through new sessions.
+ - _Stealthy web browser (default)_ - Another headless web browser with anti-blocking measures enabled. Try this if you encounter anti-bot protections while scraping.
+ - _Raw HTTP client_ - High-performance crawling mode that uses raw HTTP requests to fetch pages. It's faster and cheaper, but might not work on all websites.
##### Universal API call
@@ -160,10 +161,10 @@ Users access Apify through your platform without needing an Apify account. Apify
To help Apify monitor and support your integration, every API request should identify your platform. You can do this in one of two ways:
- Preferred:
- - Use the `x-apify-integration-platform` header with your platform name (e.g., make.com, zapier).
- - If your platform has multiple Apify apps, also include the `x-apify-integration-app-id` header with the unique app ID.
+ - Use the `x-apify-integration-platform` header with your platform name (e.g., make.com, zapier).
+ - If your platform has multiple Apify apps, also include the `x-apify-integration-app-id` header with the unique app ID.
- Alternative:
- - Set a custom `User-Agent` header that identifies your platform.
+ - Set a custom `User-Agent` header that identifies your platform.
These identifiers enable better analytics and support for your integration.
@@ -175,28 +176,28 @@ These identifiers enable better analytics and support for your integration.
- [Apify API Reference](https://docs.apify.com/api/v2)
- Client libraries
- - [JavaScript/TypeScript/Node.js](https://docs.apify.com/api/client/js/)
- - [Python](https://docs.apify.com/api/client/python/)
+ - [JavaScript/TypeScript/Node.js](https://docs.apify.com/api/client/js/)
+ - [Python](https://docs.apify.com/api/client/python/)
### Reference implementations
For inspiration, check out the public repositories of Apify's existing external integrations:
- Zapier
- - [Zapier integration documentation](https://docs.apify.com/platform/integrations/zapier)
- - [Source code on Github](https://github.com/apify/apify-zapier-integration)
+ - [Zapier integration documentation](https://docs.apify.com/platform/integrations/zapier)
+ - [Source code on Github](https://github.com/apify/apify-zapier-integration)
- Make.com
- - [Make.com integration documentation](https://docs.apify.com/platform/integrations/make)
+ - [Make.com integration documentation](https://docs.apify.com/platform/integrations/make)
- Kestra
- - [Kestra integration documentation](https://kestra.io/plugins/plugin-apify)
- - [Source code on Github](https://github.com/kestra-io/plugin-apify)
+ - [Kestra integration documentation](https://kestra.io/plugins/plugin-apify)
+ - [Source code on Github](https://github.com/kestra-io/plugin-apify)
- Keboola
- - [Keboola integration documentation](https://docs.apify.com/platform/integrations/keboola)
- - [Source code on GitHub](https://github.com/apify/keboola-ex-apify/) (JavaScript)
- - [Google Maps Reviews Scraper integration](https://github.com/apify/keboola-gmrs/) (Actor-specific)
+ - [Keboola integration documentation](https://docs.apify.com/platform/integrations/keboola)
+ - [Source code on GitHub](https://github.com/apify/keboola-ex-apify/) (JavaScript)
+ - [Google Maps Reviews Scraper integration](https://github.com/apify/keboola-gmrs/) (Actor-specific)
- Airbyte
- - [Source code on GitHub](https://github.com/airbytehq/airbyte/tree/master/airbyte-integrations/connectors/source-apify-dataset) (Python)
+ - [Source code on GitHub](https://github.com/airbytehq/airbyte/tree/master/airbyte-integrations/connectors/source-apify-dataset) (Python)
- Pipedream
- - [Source code on GitHub](https://github.com/PipedreamHQ/pipedream/tree/65e79d1d66cf0f2fca5ad20a18acd001f5eea069/components/apify)
+ - [Source code on GitHub](https://github.com/PipedreamHQ/pipedream/tree/65e79d1d66cf0f2fca5ad20a18acd001f5eea069/components/apify)
For technical support, please contact us at [integrations@apify.com](mailto:integrations@apify.com).
diff --git a/sources/platform/integrations/programming/api.md b/sources/platform/integrations/programming/api.md
index 6923b2d64f..8295bf3f7f 100644
--- a/sources/platform/integrations/programming/api.md
+++ b/sources/platform/integrations/programming/api.md
@@ -26,7 +26,7 @@ To access the Apify API in your integrations, you need to authenticate using you
:::caution
Do not share the API token with untrusted parties, or use it directly from client-side code,
-unless you fully understand the consequences! You can also consider [limiting the permission scope](#limited-permissions) of the token, so that it can only access what it really needs.
+unless you fully understand the consequences! You can also consider [limiting the permission scope](#limited-permissions) of the token, so that it can only access what it really needs.
:::
## Authentication
@@ -52,7 +52,6 @@ For better security awareness, the UI marks tokens identified as compromised, ma

-
## Organization accounts
When working under an organization account, you will see two types of API tokens on the Integrations page.
@@ -179,7 +178,6 @@ If the toggle is **off**, the token can still trigger and inspect runs, but acce
- For accounts with **Restricted general resource access**, the token cannot read or write to default storages. [Learn more about restricted general resource access](/platform/collaboration/general-resource-access).
- For accounts with **Unrestricted general resource access**, the default storages can still be read anonymously using their IDs, but writing is prevented.
-
:::tip
Let's say your Actor produces a lot of data that you want to delete just after the Actor finishes. If you enable this toggle, your scoped token will be allowed to do that.
:::
diff --git a/sources/platform/integrations/programming/webhooks/actions.md b/sources/platform/integrations/programming/webhooks/actions.md
index c2f8398fb1..51c6a8b262 100644
--- a/sources/platform/integrations/programming/webhooks/actions.md
+++ b/sources/platform/integrations/programming/webhooks/actions.md
@@ -5,7 +5,7 @@ sidebar_position: 2
slug: /integrations/webhooks/actions
---
-**Send notifications when specific events occur in your Actor/task run or build. Dynamically add data to the notification payload.**
+**Send notifications when specific events occur in your Actor/task run or build. Dynamically add data to the notification payload.**
---
@@ -92,22 +92,22 @@ The syntax of a variable is: `{{oneOfAvailableVariables}}`. Variables support ac
```json5
{
- "userId": "abf6vtB2nvQZ4nJzo",
- "createdAt": "2019-01-09T15:59:56.408Z",
- "eventType": "ACTOR.RUN.SUCCEEDED",
- "eventData": {
- "actorId": "fW4MyDhgwtMLrB987",
- "actorRunId": "uPBN9qaKd2iLs5naZ"
+ userId: 'abf6vtB2nvQZ4nJzo',
+ createdAt: '2019-01-09T15:59:56.408Z',
+ eventType: 'ACTOR.RUN.SUCCEEDED',
+ eventData: {
+ actorId: 'fW4MyDhgwtMLrB987',
+ actorRunId: 'uPBN9qaKd2iLs5naZ',
},
- "resource": {
- "id": "uPBN9qaKd2iLs5naZ",
- "actId": "fW4MyDhgwtMLrB987",
- "userId": "abf6vtB2nvQZ4nJzo",
- "startedAt": "2019-01-09T15:59:40.750Z",
- "finishedAt": "2019-01-09T15:59:56.408Z",
- "status": "SUCCEEDED",
+ resource: {
+ id: 'uPBN9qaKd2iLs5naZ',
+ actId: 'fW4MyDhgwtMLrB987',
+ userId: 'abf6vtB2nvQZ4nJzo',
+ startedAt: '2019-01-09T15:59:40.750Z',
+ finishedAt: '2019-01-09T15:59:56.408Z',
+ status: 'SUCCEEDED',
// ...
- }
+ },
}
```
@@ -156,13 +156,13 @@ The headers template is a JSON-like text where you can add additional informatio
Note that the following HTTP headers are always set by the system and your changes will always be rewritten:
-| Variable | Value |
-|---------------------------|-------------------------|
-| `Host` | Request URL |
-| `Content-Type` | `application/json` |
-| `X-Apify-Webhook` | Apify internal value |
-| `X-Apify-Webhook-Dispatch-Id` | Apify webhook dispatch ID |
-| `X-Apify-Request-Origin` | Apify origin |
+| Variable | Value |
+| ----------------------------- | ------------------------- |
+| `Host` | Request URL |
+| `Content-Type` | `application/json` |
+| `X-Apify-Webhook` | Apify internal value |
+| `X-Apify-Webhook-Dispatch-Id` | Apify webhook dispatch ID |
+| `X-Apify-Request-Origin` | Apify origin |
## Description
@@ -170,13 +170,13 @@ The description is an optional string that you can add to the webhook. It serves
## Available variables
-| Variable | Type | Description |
-|-------------|--------|-------------------------------------------------------------------------------------|
-| `userId` | string | ID of the Apify user who owns the webhook. |
-| `createdAt` | string | ISO string date of the webhook's trigger event. |
-| `eventType` | string | Type of the trigger event, see [Events](/platform/integrations/webhooks/events). |
-| `eventData` | Object | Data associated with the trigger event, see [Events](/platform/integrations/webhooks/events). |
-| `resource` | Object | The resource that caused the trigger event. |
+| Variable | Type | Description |
+| ----------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `userId` | string | ID of the Apify user who owns the webhook. |
+| `createdAt` | string | ISO string date of the webhook's trigger event. |
+| `eventType` | string | Type of the trigger event, see [Events](/platform/integrations/webhooks/events). |
+| `eventData` | Object | Data associated with the trigger event, see [Events](/platform/integrations/webhooks/events). |
+| `resource` | Object | The resource that caused the trigger event. |
| `globals` | Object | Data available in global context. Contains `dateISO` (date of webhook's trigger event in ISO 8601 format) and `dateUnix` (date of trigger event in Unix time in seconds) |
### Resource
diff --git a/sources/platform/integrations/programming/webhooks/events.md b/sources/platform/integrations/programming/webhooks/events.md
index f6050fec59..138b0a7e46 100644
--- a/sources/platform/integrations/programming/webhooks/events.md
+++ b/sources/platform/integrations/programming/webhooks/events.md
@@ -5,8 +5,7 @@ sidebar_position: 1
slug: /integrations/webhooks/events
---
-
-**Specify the types of events that trigger a webhook in an Actor or task run. Trigger an action on Actor or task run creation, success, failure, termination or timeout.**
+**Specify the types of events that trigger a webhook in an Actor or task run. Trigger an action on Actor or task run creation, success, failure, termination or timeout.**
---
@@ -18,12 +17,12 @@ Actor run events are triggered when an Actor run is created or transitions to a
### Event types
-* `ACTOR.RUN.CREATED` - A new Actor run has been created.
-* `ACTOR.RUN.SUCCEEDED` - An Actor run finished with status `SUCCEEDED`.
-* `ACTOR.RUN.FAILED` - An Actor run finished with status `FAILED`.
-* `ACTOR.RUN.ABORTED` - An Actor run finished with status `ABORTED`.
-* `ACTOR.RUN.TIMED_OUT` - An Actor run finished with status `TIMED-OUT`.
-* `ACTOR.RUN.RESURRECTED` - An Actor run has been resurrected.
+- `ACTOR.RUN.CREATED` - A new Actor run has been created.
+- `ACTOR.RUN.SUCCEEDED` - An Actor run finished with status `SUCCEEDED`.
+- `ACTOR.RUN.FAILED` - An Actor run finished with status `FAILED`.
+- `ACTOR.RUN.ABORTED` - An Actor run finished with status `ABORTED`.
+- `ACTOR.RUN.TIMED_OUT` - An Actor run finished with status `TIMED-OUT`.
+- `ACTOR.RUN.RESURRECTED` - An Actor run has been resurrected.
### Event data
@@ -31,9 +30,9 @@ The following data is provided for Actor run events:
```json5
{
- "actorId": "ID of the triggering Actor.",
- "actorTaskId": "If task was used, its ID.",
- "actorRunId": "ID of the triggering Actor run.",
+ actorId: 'ID of the triggering Actor.',
+ actorTaskId: 'If task was used, its ID.',
+ actorRunId: 'ID of the triggering Actor run.',
}
```
@@ -51,11 +50,11 @@ Actor build events are triggered when an Actor build is created or transitions i
### Event types
-* `ACTOR.BUILD.CREATED` - A new Actor build has been created.
-* `ACTOR.BUILD.SUCCEEDED` - An Actor build finished with the status `SUCCEEDED`.
-* `ACTOR.BUILD.FAILED` - An Actor build finished with the status `FAILED`.
-* `ACTOR.BUILD.ABORTED` - An Actor build finished with the status `ABORTED`.
-* `ACTOR.BUILD.TIMED_OUT` - An Actor build finished with the status `TIMED-OUT`.
+- `ACTOR.BUILD.CREATED` - A new Actor build has been created.
+- `ACTOR.BUILD.SUCCEEDED` - An Actor build finished with the status `SUCCEEDED`.
+- `ACTOR.BUILD.FAILED` - An Actor build finished with the status `FAILED`.
+- `ACTOR.BUILD.ABORTED` - An Actor build finished with the status `ABORTED`.
+- `ACTOR.BUILD.TIMED_OUT` - An Actor build finished with the status `TIMED-OUT`.
### Event Data
@@ -63,7 +62,7 @@ The following data is provided for Actor build events:
```json5
{
- "actorId": "ID of the triggering Actor.",
- "actorBuildId": "ID of the triggering Actor build.",
+ actorId: 'ID of the triggering Actor.',
+ actorBuildId: 'ID of the triggering Actor build.',
}
```
diff --git a/sources/platform/integrations/programming/webhooks/index.md b/sources/platform/integrations/programming/webhooks/index.md
index c9d39efc04..8b0ee1670a 100644
--- a/sources/platform/integrations/programming/webhooks/index.md
+++ b/sources/platform/integrations/programming/webhooks/index.md
@@ -20,10 +20,10 @@ To define a webhook, select a system **event** that triggers the webhook. Then,
:::info Current webhook limitations
- Currently, the only available action is to send a POST HTTP request to a URL specified in the webhook.
+Currently, the only available action is to send a POST HTTP request to a URL specified in the webhook.
:::
-* [**Events**](/platform/integrations/webhooks/events)
-* [**Actions**](/platform/integrations/webhooks/actions)
-* [**Ad-hoc webhooks**](/platform/integrations/webhooks/ad-hoc-webhooks)
+- [**Events**](/platform/integrations/webhooks/events)
+- [**Actions**](/platform/integrations/webhooks/actions)
+- [**Ad-hoc webhooks**](/platform/integrations/webhooks/ad-hoc-webhooks)
diff --git a/sources/platform/integrations/workflows-and-notifications/bubble.md b/sources/platform/integrations/workflows-and-notifications/bubble.md
index 33292e9049..d902893e1a 100644
--- a/sources/platform/integrations/workflows-and-notifications/bubble.md
+++ b/sources/platform/integrations/workflows-and-notifications/bubble.md
@@ -9,6 +9,7 @@ slug: /integrations/bubble
**Learn how to integrate your Apify Actors with Bubble for automated workflows and notifications.**
---
+
[Bubble](https://bubble.io/) is a no-code platform that allows you to build web applications without writing code. With the [Apify integration for Bubble](https://bubble.io/plugin/apify-1749639212621x698168698147962900), you can easily connect your Apify Actors to your Bubble applications to automate workflows and display scraped data.
:::tip Explore the live demo
@@ -60,8 +61,7 @@ For security, avoid hardcoding the token in action settings. Store it on the `Us
When configuring Apify actions in a workflow (check out screenshot below), set the token field dynamically to:
- `Current User's apify_api_token`
- - 
-
+ - 
## Using the integration
@@ -72,14 +72,14 @@ Once the plugin is configured, you can start building automated workflows.
Apify's Bubble plugin exposes two ways to interact with Apify:
- **Actions (workflow steps)**: Executed inside a Bubble workflow (both page workflows and backend workflows). Use these to trigger side effects like running an Actor or Task, or creating a webhook. They run during the workflow execution and can optionally wait for the result (if timeout is greater than 0).
- - Examples: **Run Actor**, **Run Actor Task**, **Create Webhook**, **Delete Webhook**.
- - Location in Bubble: **Workflow editor → Add an action → Plugins → Apify**
- - 
+ - Examples: **Run Actor**, **Run Actor Task**, **Create Webhook**, **Delete Webhook**.
+ - Location in Bubble: **Workflow editor → Add an action → Plugins → Apify**
+ - 
- **Data calls (data sources)**: Used as data sources in element properties and expressions. They fetch data from Apify and return it as lists/objects that you can bind to UI (for example, a repeating group) or use inside expressions.
- - Examples: **Fetch Data From Dataset JSON As Data**, **List Actor Runs**, **Get Record As Text/Image/File** from key-value store, **List User Datasets/Actors/Tasks**.
- - Location in Bubble: In any property input where a data source is expected click **Insert dynamic data**, under **Data sources** select **Get Data from an External API**, and choose the desired Apify data call.
- - 
+ - Examples: **Fetch Data From Dataset JSON As Data**, **List Actor Runs**, **Get Record As Text/Image/File** from key-value store, **List User Datasets/Actors/Tasks**.
+ - Location in Bubble: In any property input where a data source is expected click **Insert dynamic data**, under **Data sources** select **Get Data from an External API**, and choose the desired Apify data call.
+ - 
:::tip Inline documentation
@@ -87,17 +87,16 @@ Each Apify plugin action and data call input in Bubble includes inline documenta
:::
-
### Dynamic values in inputs and data calls
Dynamic values are available across Apify plugin fields. Use Bubble's **Insert dynamic data** to bind values from your app.
- For instance you can source values from:
- - **Page/UI elements**: inputs, dropdowns, multi-selects, radio buttons, checkboxes
- - **Database Things and fields**
- - **Current User**
- - **Previous workflow steps** (e.g., Step 2's Run Actor result's `defaultDatasetId` or `runId`)
- - **Get Data from an External API**: data calls
+ - **Page/UI elements**: inputs, dropdowns, multi-selects, radio buttons, checkboxes
+ - **Database Things and fields**
+ - **Current User**
+ - **Previous workflow steps** (e.g., Step 2's Run Actor result's `defaultDatasetId` or `runId`)
+ - **Get Data from an External API**: data calls
#### Examples
@@ -105,7 +104,7 @@ Dynamic values are available across Apify plugin fields. Use Bubble's **Insert d
```json
{
- "url": "Input URL's value"
+ "url": "Input URL's value"
}
```
@@ -115,7 +114,6 @@ When inserting dynamic data, Bubble replaces the selected text. Place your curso
:::
-
## Run Apify plugin actions from Bubble events
Create workflows that run Apify plugin actions in response to events in your Bubble app, such as button clicks or form submissions.
@@ -141,22 +139,21 @@ Create workflows that run Apify plugin actions in response to events in your Bub
Find IDs directly in Apify Console. Each resource page shows the ID in the API panel and in the page URL.
- **Actor ID**: Actor detail page → API panel or URL.
- - Example URL: `https://console.apify.com/actors/`
- - Actor name format: owner/name (e.g., `apify/website-scraper`)
+ - Example URL: `https://console.apify.com/actors/`
+ - Actor name format: owner/name (e.g., `apify/website-scraper`)
- **Task ID**: Task detail page → API panel or URL.
- - Example URL: `https://console.apify.com/actors/tasks/`
+ - Example URL: `https://console.apify.com/actors/tasks/`
- **Dataset ID**: Storage → Datasets → Dataset detail → API panel or URL.
- - Example URL: `https://console.apify.com/storage/datasets/`
- - Also available in the table in `Storage → Datasets` page
+ - Example URL: `https://console.apify.com/storage/datasets/`
+ - Also available in the table in `Storage → Datasets` page
- **Key-value store ID**: Storage → Key-value stores → Store detail → API panel or URL.
- - Example URL: `https://console.apify.com/storage/key-value-stores/`
- - Also available in the table in `Storage → Key-value stores` page
+ - Example URL: `https://console.apify.com/storage/key-value-stores/`
+ - Also available in the table in `Storage → Key-value stores` page
- **Webhook ID**: Actors → Actor → Integrations.
- - Example URL: `https://console.apify.com/actors//integrations/`
+ - Example URL: `https://console.apify.com/actors//integrations/`
You can also discover IDs via the plugin responses and data calls (e.g., **List User Datasets**, **List Actor Runs**), which return objects with `id` fields you can pass into other actions/data calls.
-
## Display Apify data in your application
Populate elements in your Bubble application with information from your Apify account or Actor run data.
@@ -174,12 +171,12 @@ There are two common approaches:
- This example lists the current user's datasets and displays them in a repeating group.
- Add a **Repeating group** to the page.
- 1. Add data to a variable: create a custom state (for example, on the page) that will hold the list of datasets, and set it to the plugin's **List User Datasets** data call.
- - 
- 1. Set the type: in the repeating group's settings, set **Type of content** to match the dataset object your variable returns.
- - 
- 1. Bind the variable: set the repeating group's **Data source** to the variable from Step 1.
- - 
+ 1. Add data to a variable: create a custom state (for example, on the page) that will hold the list of datasets, and set it to the plugin's **List User Datasets** data call.
+ - 
+ 1. Set the type: in the repeating group's settings, set **Type of content** to match the dataset object your variable returns.
+ - 
+ 1. Bind the variable: set the repeating group's **Data source** to the variable from Step 1.
+ - 
- Inside the repeating group cell, bind dataset fields (for example, `Current cell's item name`, `id`, `createdAt`).
- 
@@ -294,5 +291,4 @@ Bubble workflows have execution time limits. For long‑running Actors, set the
Check that your JSON input is valid when providing **Input overrides** and that dynamic expressions resolve to valid JSON values. Verify the structure of the dataset output when displaying it in your app.
-
If you have any questions or need help, feel free to reach out to us on our [developer community on Discord](https://discord.com/invite/jyEM2PRvMU).
diff --git a/sources/platform/integrations/workflows-and-notifications/gmail.md b/sources/platform/integrations/workflows-and-notifications/gmail.md
index 866960f930..9fb0e4d4b7 100644
--- a/sources/platform/integrations/workflows-and-notifications/gmail.md
+++ b/sources/platform/integrations/workflows-and-notifications/gmail.md
@@ -29,7 +29,7 @@ To use the Apify integration for Gmail, you will need:

1. Set up the integration details. **Subject** and **Body** fields can make use of available variables. Dataset can be attached in several formats.
- By default, the integration is triggered by successful runs only.
+ By default, the integration is triggered by successful runs only.

@@ -40,4 +40,3 @@ Once this is done, run your Actor to test whether the integration is working.
You can manage your connected accounts at **[Settings > API & Integrations](https://console.apify.com/settings/integrations)**.

-
diff --git a/sources/platform/integrations/workflows-and-notifications/gumloop/index.md b/sources/platform/integrations/workflows-and-notifications/gumloop/index.md
index 923029920c..d2bd121c62 100644
--- a/sources/platform/integrations/workflows-and-notifications/gumloop/index.md
+++ b/sources/platform/integrations/workflows-and-notifications/gumloop/index.md
@@ -32,11 +32,11 @@ Retrieving data from Apify Actors is included in your Gumloop subscription. Apif
Each tool has a corresponding Gumloop credit cost. Each Gumloop subscription comes with a set of credits.
-| Sample prompt | Tool | Credit cost per use |
-| :--- | :--- | :--- |
-| Retrieve profile details for an Instagram user | Get Profile Details | 5 credits/profile |
-| Get videos for a specific hashtag | Get Hashtag Videos | 3 credits/video |
-| Show 5 most recent reviews for a restaurant | Get Place Reviews | 3 credits/review |
+| Sample prompt | Tool | Credit cost per use |
+| :--------------------------------------------- | :------------------ | :------------------ |
+| Retrieve profile details for an Instagram user | Get Profile Details | 5 credits/profile |
+| Get videos for a specific hashtag | Get Hashtag Videos | 3 credits/video |
+| Show 5 most recent reviews for a restaurant | Get Place Reviews | 3 credits/review |
## General integration (Apify Task Runner)
@@ -62,13 +62,13 @@ To use the Apify integration in Gumloop, you need an Apify account, a Gumloop ac
1. _Add Apify Task Runner node to your workflow_
- Open a new Gumloop pipeline page. Search for **Apify Task Runner** in the **Node Library**, and drag and drop the node onto your canvas.
+ Open a new Gumloop pipeline page. Search for **Apify Task Runner** in the **Node Library**, and drag and drop the node onto your canvas.

1. _Create and save tasks in Apify_
- The Apify Task Runner node fetches tasks from your saved tasks in Apify Console. To create a task, navigate to [**Actors**](https://console.apify.com/actors), click on the Actor you want to use, and then click **Create a task** next to the Run button. Configure your task settings and save.
+ The Apify Task Runner node fetches tasks from your saved tasks in Apify Console. To create a task, navigate to [**Actors**](https://console.apify.com/actors), click on the Actor you want to use, and then click **Create a task** next to the Run button. Configure your task settings and save.

diff --git a/sources/platform/integrations/workflows-and-notifications/gumloop/instagram.md b/sources/platform/integrations/workflows-and-notifications/gumloop/instagram.md
index d8ddad957d..3ec32c9783 100644
--- a/sources/platform/integrations/workflows-and-notifications/gumloop/instagram.md
+++ b/sources/platform/integrations/workflows-and-notifications/gumloop/instagram.md
@@ -18,16 +18,16 @@ Using the Gumloop Instagram MCP node, you can prompt the Instagram data you need
You can pull the following types of data from public Instagram accounts using Gumloop’s Instagram node (via Apify). Each action has a credit cost.
-| Tool/Action | Description | Credit Cost |
-| :---- | :---- | :---- |
-| Get profile posts | Fetch posts from a public Instagram profile, including captions, images, like and comment counts, and metadata. | 3 credits per item |
-| Get post comments | Retrieve all comments on a specific post, with author info, timestamps, and like counts. | 3 credits per item |
-| Get hashtag posts | Search by hashtag and return matching posts with full details. | 3 credits per item |
-| Find users | Look up Instagram users by name or handle and return profile metadata like bio, follower/following counts, etc. | 3 credits per item |
-| Get profile details | Extract detailed metadata from a profile, including follower count, bio, and verification status. | 5 credits per item |
-| Get profile stories | Get media URLs, timestamps, and view counts from an Instagram profile’s stories. | 3 credits per item |
-| Get profile reels | Fetch reels with captions, engagement metrics, play counts, and music info. | 3 credits per item |
-| Get tagged posts | Return posts where a specific user is tagged, with full post details. | 3 credits per item |
+| Tool/Action | Description | Credit Cost |
+| :------------------ | :-------------------------------------------------------------------------------------------------------------- | :----------------- |
+| Get profile posts | Fetch posts from a public Instagram profile, including captions, images, like and comment counts, and metadata. | 3 credits per item |
+| Get post comments | Retrieve all comments on a specific post, with author info, timestamps, and like counts. | 3 credits per item |
+| Get hashtag posts | Search by hashtag and return matching posts with full details. | 3 credits per item |
+| Find users | Look up Instagram users by name or handle and return profile metadata like bio, follower/following counts, etc. | 3 credits per item |
+| Get profile details | Extract detailed metadata from a profile, including follower count, bio, and verification status. | 5 credits per item |
+| Get profile stories | Get media URLs, timestamps, and view counts from an Instagram profile’s stories. | 3 credits per item |
+| Get profile reels | Fetch reels with captions, engagement metrics, play counts, and music info. | 3 credits per item |
+| Get tagged posts | Return posts where a specific user is tagged, with full post details. | 3 credits per item |
## Retrieve Instagram data in Gumloop
@@ -44,7 +44,6 @@ You can pull the following types of data from public Instagram accounts using Gu

:::tip Prompting tips
-
- MCP nodes only have access to the tools listed so your prompt should be scoped to Instagram.
- You can mix and match different tools (get 10 latest videos for a hashtag and retrieve profile data for each post).
diff --git a/sources/platform/integrations/workflows-and-notifications/gumloop/maps.md b/sources/platform/integrations/workflows-and-notifications/gumloop/maps.md
index 8a77a1bf7b..ced4d0a325 100644
--- a/sources/platform/integrations/workflows-and-notifications/gumloop/maps.md
+++ b/sources/platform/integrations/workflows-and-notifications/gumloop/maps.md
@@ -18,14 +18,13 @@ Using the Gumloop Google Maps MCP node, you can simply prompt the location data
You can pull the following types of place data from Google Maps using Gumloop’s Google Maps node (via Apify). Each action has a credit cost.
-| Tool/Action | Description | Credit Cost |
-| :---- | :---- | :---- |
-| Search places | Search for places on Google Maps using location and search terms. | 3 credits per item |
-| Get place details | Retrieve detailed information about a specific place using its URL or place ID. | 5 credits per item |
-| Search by category | Search for places by a specific category (e.g. cafes, gyms) on Google Maps. | 3 credits per item |
-| Get place reviews | Fetch reviews for specific locations, including text, rating, and reviewer info. | 3 credits per item |
-| Find places in area | Return all visible places within a defined map area or bounding box. | 3 credits per item |
-
+| Tool/Action | Description | Credit Cost |
+| :------------------ | :------------------------------------------------------------------------------- | :----------------- |
+| Search places | Search for places on Google Maps using location and search terms. | 3 credits per item |
+| Get place details | Retrieve detailed information about a specific place using its URL or place ID. | 5 credits per item |
+| Search by category | Search for places by a specific category (e.g. cafes, gyms) on Google Maps. | 3 credits per item |
+| Get place reviews | Fetch reviews for specific locations, including text, rating, and reviewer info. | 3 credits per item |
+| Find places in area | Return all visible places within a defined map area or bounding box. | 3 credits per item |
## Retrieve Google Maps data in Gumloop
@@ -42,7 +41,6 @@ You can pull the following types of place data from Google Maps using Gumloop’

:::tip Prompting tips
-
- MCP nodes only have access to the tools listed so your prompt should be scoped to Google Maps.
- You can mix and match different tools (e.g., search for gyms in Vancouver → get place details → pull reviews).
diff --git a/sources/platform/integrations/workflows-and-notifications/gumloop/tiktok.md b/sources/platform/integrations/workflows-and-notifications/gumloop/tiktok.md
index 61e10bd92f..1ed6928f73 100644
--- a/sources/platform/integrations/workflows-and-notifications/gumloop/tiktok.md
+++ b/sources/platform/integrations/workflows-and-notifications/gumloop/tiktok.md
@@ -17,13 +17,13 @@ Using the Gumloop TikTok MCP node, you can simply prompt the TikTok data you nee
You can pull the following types of data from TikTok using Gumloop’s TikTok node (via Apify). Each action has a credits cost.
-| Tool/Action | Description | Credit Cost |
-| :---- | :---- | :---- |
-| Get hashtag videos | Fetch videos from TikTok hashtags with captions, engagement metrics, play counts, and author information. | 3 credits per item |
-| Get profile videos | Get videos from TikTok user profiles with video metadata, engagement stats, music info, and timestamps. | 3 credits per item |
-| Get profile followers | Retrieve followers or following lists from TikTok profiles, including usernames, follower counts, and bios. | 3 credits per item |
-| Get video details | Get comprehensive data on a specific TikTok video using its URL—includes engagement and video-level metrics. | 5 credits per item |
-| Search videos | Search TikTok for videos and users using queries. Returns video details and user profile info. | 3 credits per item |
+| Tool/Action | Description | Credit Cost |
+| :-------------------- | :----------------------------------------------------------------------------------------------------------- | :----------------- |
+| Get hashtag videos | Fetch videos from TikTok hashtags with captions, engagement metrics, play counts, and author information. | 3 credits per item |
+| Get profile videos | Get videos from TikTok user profiles with video metadata, engagement stats, music info, and timestamps. | 3 credits per item |
+| Get profile followers | Retrieve followers or following lists from TikTok profiles, including usernames, follower counts, and bios. | 3 credits per item |
+| Get video details | Get comprehensive data on a specific TikTok video using its URL—includes engagement and video-level metrics. | 5 credits per item |
+| Search videos | Search TikTok for videos and users using queries. Returns video details and user profile info. | 3 credits per item |
## Retrieve Tiktok Data in Gumloop
@@ -40,7 +40,6 @@ You can pull the following types of data from TikTok using Gumloop’s TikTok no

:::tip Prompting tips
-
- MCP nodes only have access to the tools listed so your prompt should be scoped to TikTok.
- You can mix and match different tools (e.g., search a hashtag → get profile videos → retrieve engagement data).
diff --git a/sources/platform/integrations/workflows-and-notifications/gumloop/youtube.md b/sources/platform/integrations/workflows-and-notifications/gumloop/youtube.md
index 7565c8747e..1a8f7c3b1a 100644
--- a/sources/platform/integrations/workflows-and-notifications/gumloop/youtube.md
+++ b/sources/platform/integrations/workflows-and-notifications/gumloop/youtube.md
@@ -18,13 +18,13 @@ Using the Gumloop YouTube MCP node, you can simply prompt the YouTube data you n
You can pull the following types of data from YouTube using Gumloop’s YouTube node (via Apify). Each action has a credit cost:
-| Tool/Action | Description | Credit Cost |
-| :---- | :---- | :---- |
-| Search videos | Search YouTube by keywords and get video results with filtering, metadata, and content info. | 3 credit per item |
-| Get video details | Retrieve detailed stats and content info for specific videos via URL or ID. | 4 credit per item |
-| Get channel videos | Get videos from a specific YouTube channel with full metadata and context. | 3 credit per item |
-| Get playlist videos | Fetch videos from a YouTube playlist with metadata and playlist details. | 3 credit per item |
-| Get channel details | Get channel metadata including subscriber count, total videos, description, and more. | 5 credit per item |
+| Tool/Action | Description | Credit Cost |
+| :------------------ | :------------------------------------------------------------------------------------------- | :---------------- |
+| Search videos | Search YouTube by keywords and get video results with filtering, metadata, and content info. | 3 credit per item |
+| Get video details | Retrieve detailed stats and content info for specific videos via URL or ID. | 4 credit per item |
+| Get channel videos | Get videos from a specific YouTube channel with full metadata and context. | 3 credit per item |
+| Get playlist videos | Fetch videos from a YouTube playlist with metadata and playlist details. | 3 credit per item |
+| Get channel details | Get channel metadata including subscriber count, total videos, description, and more. | 5 credit per item |
## Retrieve YouTube data in Gumloop
@@ -41,7 +41,6 @@ You can pull the following types of data from YouTube using Gumloop’s YouTube

:::tip Prompting tips
-
- MCP nodes only have access to the tools listed so your prompt should be scoped to YouTube.
- You can mix and match different tools (e.g., search for videos → get video details → extract channel info).
@@ -70,4 +69,3 @@ You can pull the following types of data from YouTube using Gumloop’s YouTube
- [TikTok](/platform/integrations/gumloop/tiktok)
- [Instagram](/platform/integrations/gumloop/instagram)
- [Google Maps](/platform/integrations/gumloop/maps)
-
diff --git a/sources/platform/integrations/workflows-and-notifications/ifttt.md b/sources/platform/integrations/workflows-and-notifications/ifttt.md
index 0c700d8565..b9032bc604 100644
--- a/sources/platform/integrations/workflows-and-notifications/ifttt.md
+++ b/sources/platform/integrations/workflows-and-notifications/ifttt.md
@@ -50,14 +50,14 @@ To create an Applet that starts when Apify event occurs:
1. In the **If this** section, click **Add**.
1. Search for and select **Apify** in the service list.
- 
+
1. Select a trigger from the available options:
- **Actor Run Finished**: Triggers when a selected Actor run completes
- **Task Run Finished**: Triggers when a selected Actor task run completes
- 
+
1. Configure the trigger by selecting the specific Actor or task.
1. Click **Create trigger** to continue.
@@ -75,30 +75,30 @@ To use Apify as an action in your Applet:
- **Run Actor**: Starts an Actor run
- **Run Task**: Starts an Actor Task run
- 
+
1. Select the Actor or task you want to use from the dropdown menu.
- :::note
+:::note
- IFTTT displays up to 50 recent items in a dropdown. If your Actor or task isn't visible, try using it at least once via API or in the Apify Console to make it appear in the list.
+IFTTT displays up to 50 recent items in a dropdown. If your Actor or task isn't visible, try using it at least once via API or in the Apify Console to make it appear in the list.
- :::
+:::
- 
+
1. Configure the action parameters:
- | Parameter | Description | Example Values |
- |-----------|-------------|----------------|
- | **Wait until run finishes** | Defines how the Actor should be executed. | `yes`, `no` |
- | **Input overrides** | JSON input that overrides the Actor's default input. | `{"key": "value"}` |
- | **Build** | Specifies the Actor build to run. Can be a build tag or build number. See [Builds](/platform/actors/running/runs-and-builds#builds) for more information. | `0.2.10`, `version-0` |
- | **Memory** | Memory limit for the run in megabytes. See [Memory](/platform/actors/running/usage-and-resources#memory) for more information. | `256` |
+ | Parameter | Description | Example Values |
+ | --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------- |
+ | **Wait until run finishes** | Defines how the Actor should be executed. | `yes`, `no` |
+ | **Input overrides** | JSON input that overrides the Actor's default input. | `{"key": "value"}` |
+ | **Build** | Specifies the Actor build to run. Can be a build tag or build number. See [Builds](/platform/actors/running/runs-and-builds#builds) for more information. | `0.2.10`, `version-0` |
+ | **Memory** | Memory limit for the run in megabytes. See [Memory](/platform/actors/running/usage-and-resources#memory) for more information. | `256` |
1. Click **Create action** to finish setting up the action.
- 
+
1. Give your Applet a name and click **Finish** to save it.
@@ -109,7 +109,7 @@ To check if your Applet is working properly:
1. Go to your Applet's detail page.
1. Clicke the **View activity** button to see the execution history.
- 
+
## Available triggers, actions, and queries
diff --git a/sources/platform/integrations/workflows-and-notifications/make/ai-crawling.md b/sources/platform/integrations/workflows-and-notifications/make/ai-crawling.md
index d1a08ebcf6..1a9b4c32e0 100644
--- a/sources/platform/integrations/workflows-and-notifications/make/ai-crawling.md
+++ b/sources/platform/integrations/workflows-and-notifications/make/ai-crawling.md
@@ -76,22 +76,22 @@ For each crawled web page, you'll receive:
```json title="Sample output (shortened)"
{
- "url": "https://docs.apify.com/academy/scraping-basics-javascript",
- "crawl": {
- "loadedUrl": "https://docs.apify.com/academy/scraping-basics-javascript",
- "loadedTime": "2025-04-22T14:33:20.514Z",
- "referrerUrl": "https://docs.apify.com/academy",
- "depth": 1,
- "httpStatusCode": 200
- },
- "metadata": {
- "canonicalUrl": "https://docs.apify.com/academy/scraping-basics-javascript",
- "title": "Web scraping basics for JavaScript devs | Apify Documentation",
- "description": "Learn how to use JavaScript to extract information from websites in this practical course, starting from the absolute basics.",
- "languageCode": "en",
- "markdown": "# Web scraping basics for JavaScript devs\n\nWelcome to our comprehensive web scraping tutorial for beginners. This guide will take you through the fundamentals of extracting data from websites, with practical examples and exercises.\n\n## What is web scraping?\n\nWeb scraping is the process of extracting data from websites. It involves making HTTP requests to web servers, downloading HTML pages, and parsing them to extract the desired information.\n\n## Why learn web scraping?\n\n- **Data collection**: Gather information for research, analysis, or business intelligence\n- **Automation**: Save time by automating repetitive data collection tasks\n- **Integration**: Connect web data with your applications or databases\n- **Monitoring**: Track changes on websites automatically\n\n## Getting started\n\nTo begin web scraping, you'll need to understand the basics of HTML, CSS selectors, and HTTP. This tutorial will guide you through these concepts step by step.\n\n...",
- "text": "Web scraping basics for JavaScript devs\n\nWelcome to our comprehensive web scraping tutorial for beginners. This guide will take you through the fundamentals of extracting data from websites, with practical examples and exercises.\n\nWhat is web scraping?\n\nWeb scraping is the process of extracting data from websites. It involves making HTTP requests to web servers, downloading HTML pages, and parsing them to extract the desired information.\n\nWhy learn web scraping?\n\n- Data collection: Gather information for research, analysis, or business intelligence\n- Automation: Save time by automating repetitive data collection tasks\n- Integration: Connect web data with your applications or databases\n- Monitoring: Track changes on websites automatically\n\nGetting started\n\nTo begin web scraping, you'll need to understand the basics of HTML, CSS selectors, and HTTP. This tutorial will guide you through these concepts step by step.\n\n..."
- }
+ "url": "https://docs.apify.com/academy/scraping-basics-javascript",
+ "crawl": {
+ "loadedUrl": "https://docs.apify.com/academy/scraping-basics-javascript",
+ "loadedTime": "2025-04-22T14:33:20.514Z",
+ "referrerUrl": "https://docs.apify.com/academy",
+ "depth": 1,
+ "httpStatusCode": 200
+ },
+ "metadata": {
+ "canonicalUrl": "https://docs.apify.com/academy/scraping-basics-javascript",
+ "title": "Web scraping basics for JavaScript devs | Apify Documentation",
+ "description": "Learn how to use JavaScript to extract information from websites in this practical course, starting from the absolute basics.",
+ "languageCode": "en",
+ "markdown": "# Web scraping basics for JavaScript devs\n\nWelcome to our comprehensive web scraping tutorial for beginners. This guide will take you through the fundamentals of extracting data from websites, with practical examples and exercises.\n\n## What is web scraping?\n\nWeb scraping is the process of extracting data from websites. It involves making HTTP requests to web servers, downloading HTML pages, and parsing them to extract the desired information.\n\n## Why learn web scraping?\n\n- **Data collection**: Gather information for research, analysis, or business intelligence\n- **Automation**: Save time by automating repetitive data collection tasks\n- **Integration**: Connect web data with your applications or databases\n- **Monitoring**: Track changes on websites automatically\n\n## Getting started\n\nTo begin web scraping, you'll need to understand the basics of HTML, CSS selectors, and HTTP. This tutorial will guide you through these concepts step by step.\n\n...",
+ "text": "Web scraping basics for JavaScript devs\n\nWelcome to our comprehensive web scraping tutorial for beginners. This guide will take you through the fundamentals of extracting data from websites, with practical examples and exercises.\n\nWhat is web scraping?\n\nWeb scraping is the process of extracting data from websites. It involves making HTTP requests to web servers, downloading HTML pages, and parsing them to extract the desired information.\n\nWhy learn web scraping?\n\n- Data collection: Gather information for research, analysis, or business intelligence\n- Automation: Save time by automating repetitive data collection tasks\n- Integration: Connect web data with your applications or databases\n- Monitoring: Track changes on websites automatically\n\nGetting started\n\nTo begin web scraping, you'll need to understand the basics of HTML, CSS selectors, and HTTP. This tutorial will guide you through these concepts step by step.\n\n..."
+ }
}
```
diff --git a/sources/platform/integrations/workflows-and-notifications/make/amazon.md b/sources/platform/integrations/workflows-and-notifications/make/amazon.md
index 9743f62744..56c497ef75 100644
--- a/sources/platform/integrations/workflows-and-notifications/make/amazon.md
+++ b/sources/platform/integrations/workflows-and-notifications/make/amazon.md
@@ -39,7 +39,7 @@ Once connected, you can build workflows to automate Amazon data extraction and i
After connecting the app, you can use the Search module as a native scraper to extract public Amazon data. Here’s what you get:
-### Extract Amazon data
+### Extract Amazon data
Get data via [Apify's Amazon Scraper](https://apify.com/junglee/free-amazon-product-scraper). Fill in the URLs of products, searches, or categories you want to gather information about.
@@ -56,49 +56,49 @@ For Amazon URLs, you can extract:
```json title="Example"
[
{
- "title": "Logitech M185 Wireless Mouse, 2.4GHz with USB Mini Receiver, 12-Month Battery Life, 1000 DPI Optical Tracking, Ambidextrous PC/Mac/Laptop - Swift Grey",
- "asin": "B004YAVF8I",
- "brand": "Logitech",
- "stars": 4.5,
- "reviewsCount": 37418,
- "thumbnailImage": "https://m.media-amazon.com/images/I/5181UFuvoBL.__AC_SX300_SY300_QL70_FMwebp_.jpg",
- "breadCrumbs": "Electronics›Computers & Accessories›Computer Accessories & Peripherals›Keyboards, Mice & Accessories›Mice",
- "description": "Logitech Wireless Mouse M185. A simple, reliable mouse with plug-and-play wireless, a 1-year battery life and 3-year limited hardware warranty.(Battery life may vary based on user and computing conditions.) System Requirements: Windows Vista Windows 7 Windows 8 Windows 10|Mac OS X 10.5 or later|Chrome OS|Linux kernel 2.6+|USB port",
- "price": {
- "value": 13.97,
- "currency": "$"
- },
- "url": "https://www.amazon.com/dp/B004YAVF8I"
+ "title": "Logitech M185 Wireless Mouse, 2.4GHz with USB Mini Receiver, 12-Month Battery Life, 1000 DPI Optical Tracking, Ambidextrous PC/Mac/Laptop - Swift Grey",
+ "asin": "B004YAVF8I",
+ "brand": "Logitech",
+ "stars": 4.5,
+ "reviewsCount": 37418,
+ "thumbnailImage": "https://m.media-amazon.com/images/I/5181UFuvoBL.__AC_SX300_SY300_QL70_FMwebp_.jpg",
+ "breadCrumbs": "Electronics›Computers & Accessories›Computer Accessories & Peripherals›Keyboards, Mice & Accessories›Mice",
+ "description": "Logitech Wireless Mouse M185. A simple, reliable mouse with plug-and-play wireless, a 1-year battery life and 3-year limited hardware warranty.(Battery life may vary based on user and computing conditions.) System Requirements: Windows Vista Windows 7 Windows 8 Windows 10|Mac OS X 10.5 or later|Chrome OS|Linux kernel 2.6+|USB port",
+ "price": {
+ "value": 13.97,
+ "currency": "$"
+ },
+ "url": "https://www.amazon.com/dp/B004YAVF8I"
},
{
- "title": "Logitech MX Master 3S - Wireless Performance Mouse with Ultra-fast Scrolling, Ergo, 8K DPI, Track on Glass, Quiet Clicks, USB-C, Bluetooth, Windows, Linux, Chrome - Graphite",
- "asin": "B09HM94VDS",
- "brand": "Logitech",
- "stars": 4.5,
- "reviewsCount": 9333,
- "thumbnailImage": "https://m.media-amazon.com/images/I/41+eEANAv3L._AC_SY300_SX300_.jpg",
- "breadCrumbs": "Electronics›Computers & Accessories›Computer Accessories & Peripherals›Keyboards, Mice & Accessories›Mice",
- "description": "Logitech MX Master 3S Performance Wireless Mouse Introducing Logitech MX Master 3S – an iconic mouse remastered. Now with Quiet Clicks(2) and 8K DPI any-surface tracking for more feel and performance than ever before. Product details: Weight: 4.97 oz (141 g) Dimensions: 2 x 3.3 x 4.9 in (51 x 84.3 x 124.9 mm) Compatible with Windows, macOS, Linux, Chrome OS, iPadOS, Android operating systems (8) Rechargeable Li-Po (500 mAh) battery Sensor technology: Darkfield high precision Buttons: 7 buttons (Left/Right-click, Back/Forward, App-Switch, Wheel mode-shift, Middle click), Scroll Wheel, Thumbwheel, Gesture button Wireless operating distance: 33 ft (10 m) (9)Footnotes: (1) 4 mm minimum glass thickness (2) Compared to MX Master 3, MX Master 3S has 90% less Sound Power Level left and right click, measured at 1m (3) Compared to regular Logitech mouse without an electromagnetic scroll wheel (4) Compared to Logitech Master 2S mouse with Logitech Options installed and Smooth scrolling enabled (5) Requires Logi Options+ software, available for Windows and macOS (6) Not compatible with Logitech Unifying technology (7) Battery life may vary based on user and computing conditions. (8) Device basic functions will be supported without software for operating systems other than Windows and macOS (9) Wireless range may vary depending on operating environment and computer setup",
- "price": {
- "value": 89.99,
- "currency": "$"
- },
- "url": "https://www.amazon.com/dp/B09HM94VDS"
+ "title": "Logitech MX Master 3S - Wireless Performance Mouse with Ultra-fast Scrolling, Ergo, 8K DPI, Track on Glass, Quiet Clicks, USB-C, Bluetooth, Windows, Linux, Chrome - Graphite",
+ "asin": "B09HM94VDS",
+ "brand": "Logitech",
+ "stars": 4.5,
+ "reviewsCount": 9333,
+ "thumbnailImage": "https://m.media-amazon.com/images/I/41+eEANAv3L._AC_SY300_SX300_.jpg",
+ "breadCrumbs": "Electronics›Computers & Accessories›Computer Accessories & Peripherals›Keyboards, Mice & Accessories›Mice",
+ "description": "Logitech MX Master 3S Performance Wireless Mouse Introducing Logitech MX Master 3S – an iconic mouse remastered. Now with Quiet Clicks(2) and 8K DPI any-surface tracking for more feel and performance than ever before. Product details: Weight: 4.97 oz (141 g) Dimensions: 2 x 3.3 x 4.9 in (51 x 84.3 x 124.9 mm) Compatible with Windows, macOS, Linux, Chrome OS, iPadOS, Android operating systems (8) Rechargeable Li-Po (500 mAh) battery Sensor technology: Darkfield high precision Buttons: 7 buttons (Left/Right-click, Back/Forward, App-Switch, Wheel mode-shift, Middle click), Scroll Wheel, Thumbwheel, Gesture button Wireless operating distance: 33 ft (10 m) (9)Footnotes: (1) 4 mm minimum glass thickness (2) Compared to MX Master 3, MX Master 3S has 90% less Sound Power Level left and right click, measured at 1m (3) Compared to regular Logitech mouse without an electromagnetic scroll wheel (4) Compared to Logitech Master 2S mouse with Logitech Options installed and Smooth scrolling enabled (5) Requires Logi Options+ software, available for Windows and macOS (6) Not compatible with Logitech Unifying technology (7) Battery life may vary based on user and computing conditions. (8) Device basic functions will be supported without software for operating systems other than Windows and macOS (9) Wireless range may vary depending on operating environment and computer setup",
+ "price": {
+ "value": 89.99,
+ "currency": "$"
+ },
+ "url": "https://www.amazon.com/dp/B09HM94VDS"
},
{
- "title": "Apple Magic Mouse - White Multi-Touch Surface ",
- "asin": "B0DL72PK1P",
- "brand": "Apple",
- "stars": 4.6,
- "reviewsCount": 18594,
- "thumbnailImage": "",
- "breadCrumbs": "",
- "description": null,
- "price": {
- "value": 78.99,
- "currency": "$"
- },
- "url": "https://www.amazon.com/dp/B0DL72PK1P"
+ "title": "Apple Magic Mouse - White Multi-Touch Surface ",
+ "asin": "B0DL72PK1P",
+ "brand": "Apple",
+ "stars": 4.6,
+ "reviewsCount": 18594,
+ "thumbnailImage": "",
+ "breadCrumbs": "",
+ "description": null,
+ "price": {
+ "value": 78.99,
+ "currency": "$"
+ },
+ "url": "https://www.amazon.com/dp/B0DL72PK1P"
}
]
```
@@ -233,5 +233,4 @@ There are other native Make Apps powered by Apify. You can check out Apify Scrap
- [YouTube Data](/platform/integrations/make/youtube)
- [AI crawling](/platform/integrations/make/ai-crawling)
-
And more! Because you can access any of thousands of our scrapers on Apify Store by using the [general Apify connections](https://www.make.com/en/integrations/apify).
diff --git a/sources/platform/integrations/workflows-and-notifications/make/facebook.md b/sources/platform/integrations/workflows-and-notifications/make/facebook.md
index a5805f673c..2d09429488 100644
--- a/sources/platform/integrations/workflows-and-notifications/make/facebook.md
+++ b/sources/platform/integrations/workflows-and-notifications/make/facebook.md
@@ -61,39 +61,39 @@ For each given Facebook group URL, you will extract:
```json title="Profile data, shortened sample"
[
- {
- "facebookUrl": "https://www.facebook.com/groups/WeirdSecondhandFinds",
- "url": "https://www.facebook.com/groups/WeirdSecondhandFinds/permalink/3348022435381946/",
- "time": "2025-04-09T15:34:31.000Z",
- "user": {
- "name": "Author name"
- },
- "text": "4/9/2025 - This glass fish was found at a friend's yard sale and for some reason it had to come home with me. Any ideas on how to display it?",
- "reactionLikeCount": 704,
- "reactionLoveCount": 185,
- "reactionWowCount": 10,
- "reactionCareCount": 6,
- "reactionHahaCount": 3,
- "attachments": [
- {
- "url": "https://www.facebook.com/media/set/?set=pcb.3348022435381946&type=1",
- "thumbnail": "https://scontent.fcgh33-1.fna.fbcdn.net/v/t39.30808-6/490077910_10228674979643758_5977579619381197326_n.jpg?stp=dst-jpg_s600x600_tt6"
- }
- ],
- "likesCount": 908,
- "sharesCount": 3,
- "commentsCount": 852,
- "topComments": [
- {
- "commentUrl": "https://www.facebook.com/groups/WeirdSecondhandFinds/permalink/3348022435381946/?comment_id=3348201365364053",
- "text": "Would this work okay? Water and floating candle?",
- "profileName": "Bonnie FireUrchin Lambourn",
- "likesCount": 2
- }
- ],
- "facebookId": "650812835102933",
- "groupTitle": "Weird (and Wonderful) Secondhand Finds That Just Need To Be Shared"
- }
+ {
+ "facebookUrl": "https://www.facebook.com/groups/WeirdSecondhandFinds",
+ "url": "https://www.facebook.com/groups/WeirdSecondhandFinds/permalink/3348022435381946/",
+ "time": "2025-04-09T15:34:31.000Z",
+ "user": {
+ "name": "Author name"
+ },
+ "text": "4/9/2025 - This glass fish was found at a friend's yard sale and for some reason it had to come home with me. Any ideas on how to display it?",
+ "reactionLikeCount": 704,
+ "reactionLoveCount": 185,
+ "reactionWowCount": 10,
+ "reactionCareCount": 6,
+ "reactionHahaCount": 3,
+ "attachments": [
+ {
+ "url": "https://www.facebook.com/media/set/?set=pcb.3348022435381946&type=1",
+ "thumbnail": "https://scontent.fcgh33-1.fna.fbcdn.net/v/t39.30808-6/490077910_10228674979643758_5977579619381197326_n.jpg?stp=dst-jpg_s600x600_tt6"
+ }
+ ],
+ "likesCount": 908,
+ "sharesCount": 3,
+ "commentsCount": 852,
+ "topComments": [
+ {
+ "commentUrl": "https://www.facebook.com/groups/WeirdSecondhandFinds/permalink/3348022435381946/?comment_id=3348201365364053",
+ "text": "Would this work okay? Water and floating candle?",
+ "profileName": "Bonnie FireUrchin Lambourn",
+ "likesCount": 2
+ }
+ ],
+ "facebookId": "650812835102933",
+ "groupTitle": "Weird (and Wonderful) Secondhand Finds That Just Need To Be Shared"
+ }
]
```
@@ -150,87 +150,87 @@ You’ll get:
- _Comments_: Number of comments on the post
- _Shares_: Number of times the post has been shared
- _Media info_:
- - _URLs_: Links to media files
- - _Type_: Whether it's an image or video
- - _Dimensions_: Size of the media
+ - _URLs_: Links to media files
+ - _Type_: Whether it's an image or video
+ - _Dimensions_: Size of the media
- _Owner info_:
- - _Username_: Account name of the post owner
- - _User ID_: Unique identifier for the owner
- - _Full name_: Full name of the account holder
+ - _Username_: Account name of the post owner
+ - _User ID_: Unique identifier for the owner
+ - _Full name_: Full name of the account holder
- _Tags_: Hashtags used in the post
- _Location_: Geographic location tagged in the post (if available)
```json title="Example (shortened)"
[
- {
- "facebookUrl": "https://www.facebook.com/nasa",
- "postId": "1215784396583601",
- "pageName": "NASA",
- "url": "https://www.facebook.com/NASA/posts/pfbid029aLb3sDGnXuYA5P7DK5uRT7Upf39X5fwCBFcRz9C3M4EMShwJWNwLLaXA5RdYeyKl",
- "time": "2025-04-07T19:09:00.000Z",
- "user": {
- "id": "100044561550831",
- "name": "NASA - National Aeronautics and Space Administration",
- "profileUrl": "https://www.facebook.com/NASA",
- "profilePic": "https://scontent.fbog3-2.fna.fbcdn.net/v/t39.30808-1/243095782_416661036495945_3843362260429099279_n.png?stp=cp0_dst-png_s40x40&_nc_cat=1&ccb=1-7&_nc_sid=2d3e12&_nc_ohc=pGNKYYiG82gQ7kNvwGLgqmB&_nc_oc=AdmpIOT7GNKe9qxJgFM-EEuF78UvDx97YygzhxiRXW5nXDyZmQScZzHnWAFlGmn8VBk"
+ {
+ "facebookUrl": "https://www.facebook.com/nasa",
+ "postId": "1215784396583601",
+ "pageName": "NASA",
+ "url": "https://www.facebook.com/NASA/posts/pfbid029aLb3sDGnXuYA5P7DK5uRT7Upf39X5fwCBFcRz9C3M4EMShwJWNwLLaXA5RdYeyKl",
+ "time": "2025-04-07T19:09:00.000Z",
+ "user": {
+ "id": "100044561550831",
+ "name": "NASA - National Aeronautics and Space Administration",
+ "profileUrl": "https://www.facebook.com/NASA",
+ "profilePic": "https://scontent.fbog3-2.fna.fbcdn.net/v/t39.30808-1/243095782_416661036495945_3843362260429099279_n.png?stp=cp0_dst-png_s40x40&_nc_cat=1&ccb=1-7&_nc_sid=2d3e12&_nc_ohc=pGNKYYiG82gQ7kNvwGLgqmB&_nc_oc=AdmpIOT7GNKe9qxJgFM-EEuF78UvDx97YygzhxiRXW5nXDyZmQScZzHnWAFlGmn8VBk"
+ },
+ "text": "It’s your time to shine! This Citizen Science Month, contribute to a NASA Citizen Science project that will help improve life on Earth and solve cosmic mysteries.",
+ "link": "https://science.nasa.gov/citizen-science/",
+ "likes": 2016,
+ "comments": 171,
+ "shares": 217,
+ "media": [
+ {
+ "thumbnail": "https://scontent.fbog3-3.fna.fbcdn.net/v/t39.30808-6/489419147_1215784366583604_2492050236576327908_n.jpg?stp=dst-jpg_s720x720_tt6&_nc_cat=110&ccb=1-7&_nc_sid=127cfc&_nc_ohc=YI6mnyIKJmwQ7kNvwGVLR7C&_nc_oc=AdklMZgJuQZ-r924q5F9ikY0F5E_LF2gbzNnepx75qTmtJ-jDnq6Ve-VkIQ1hcaCDhA"
+ }
+ ]
},
- "text": "It’s your time to shine! This Citizen Science Month, contribute to a NASA Citizen Science project that will help improve life on Earth and solve cosmic mysteries.",
- "link": "https://science.nasa.gov/citizen-science/",
- "likes": 2016,
- "comments": 171,
- "shares": 217,
- "media": [
- {
- "thumbnail": "https://scontent.fbog3-3.fna.fbcdn.net/v/t39.30808-6/489419147_1215784366583604_2492050236576327908_n.jpg?stp=dst-jpg_s720x720_tt6&_nc_cat=110&ccb=1-7&_nc_sid=127cfc&_nc_ohc=YI6mnyIKJmwQ7kNvwGVLR7C&_nc_oc=AdklMZgJuQZ-r924q5F9ikY0F5E_LF2gbzNnepx75qTmtJ-jDnq6Ve-VkIQ1hcaCDhA"
- }
- ]
- },
- {
- "facebookUrl": "https://www.facebook.com/nasa",
- "postId": "1215717559923618",
- "pageName": "NASA",
- "url": "https://www.facebook.com/NASA/posts/pfbid01SDwDikd344679WW4Er1F1UAB3cfpBH4Ud54RJEaTtD1Fih2xSzjtsCsYXgbh93Ll",
- "time": "2025-04-07T17:04:00.000Z",
- "user": {
- "id": "100044561550831",
- "name": "NASA - National Aeronautics and Space Administration",
- "profileUrl": "https://www.facebook.com/NASA",
- "profilePic": "https://scontent.fbog3-2.fna.fbcdn.net/v/t39.30808-1/243095782_416661036495945_3843362260429099279_n.png?stp=cp0_dst-png_s40x40&_nc_cat=1&ccb=1-7&_nc_sid=2d3e12&_nc_ohc=pGNKYYiG82gQ7kNvwGLgqmB&_nc_oc=AdmpIOT7GNKe9qxJgFM-EEuF78UvDx97YygzhxiRXW5nXDyZmQScZzHnWAFlGmn8VBk"
+ {
+ "facebookUrl": "https://www.facebook.com/nasa",
+ "postId": "1215717559923618",
+ "pageName": "NASA",
+ "url": "https://www.facebook.com/NASA/posts/pfbid01SDwDikd344679WW4Er1F1UAB3cfpBH4Ud54RJEaTtD1Fih2xSzjtsCsYXgbh93Ll",
+ "time": "2025-04-07T17:04:00.000Z",
+ "user": {
+ "id": "100044561550831",
+ "name": "NASA - National Aeronautics and Space Administration",
+ "profileUrl": "https://www.facebook.com/NASA",
+ "profilePic": "https://scontent.fbog3-2.fna.fbcdn.net/v/t39.30808-1/243095782_416661036495945_3843362260429099279_n.png?stp=cp0_dst-png_s40x40&_nc_cat=1&ccb=1-7&_nc_sid=2d3e12&_nc_ohc=pGNKYYiG82gQ7kNvwGLgqmB&_nc_oc=AdmpIOT7GNKe9qxJgFM-EEuF78UvDx97YygzhxiRXW5nXDyZmQScZzHnWAFlGmn8VBk"
+ },
+ "text": "NASA's Hubble Space Telescope has studied Uranus for more than 20 years and is still learning more about its gas.",
+ "link": "https://go.nasa.gov/3RIapAw",
+ "likes": 1878,
+ "comments": 144,
+ "shares": 215,
+ "media": [
+ {
+ "thumbnail": "https://scontent.fbog3-1.fna.fbcdn.net/v/t39.30808-6/489532065_1215717536590287_873488674466633974_n.jpg?stp=dst-jpg_p180x540_tt6&_nc_cat=109&ccb=1-7&_nc_sid=127cfc&_nc_ohc=kAiP3avgomkQ7kNvwGOb-YS&_nc_oc=Adn31Ca9oiQ5ieTtUtFqcr45R4jdJdVxei1kMR1kj-RLDehS-fyEVJD1fY2-5IItLe0"
+ }
+ ]
},
- "text": "NASA's Hubble Space Telescope has studied Uranus for more than 20 years and is still learning more about its gas.",
- "link": "https://go.nasa.gov/3RIapAw",
- "likes": 1878,
- "comments": 144,
- "shares": 215,
- "media": [
- {
- "thumbnail": "https://scontent.fbog3-1.fna.fbcdn.net/v/t39.30808-6/489532065_1215717536590287_873488674466633974_n.jpg?stp=dst-jpg_p180x540_tt6&_nc_cat=109&ccb=1-7&_nc_sid=127cfc&_nc_ohc=kAiP3avgomkQ7kNvwGOb-YS&_nc_oc=Adn31Ca9oiQ5ieTtUtFqcr45R4jdJdVxei1kMR1kj-RLDehS-fyEVJD1fY2-5IItLe0"
- }
- ]
- },
- {
- "facebookUrl": "https://www.facebook.com/nasa",
- "postId": "1212614090233965",
- "pageName": "NASA",
- "url": "https://www.facebook.com/NASA/videos/958890849561531/",
- "time": "2025-04-03T18:06:29.000Z",
- "user": {
- "id": "100044561550831",
- "name": "NASA - National Aeronautics and Space Administration",
- "profileUrl": "https://www.facebook.com/NASA",
- "profilePic": "https://scontent.fssz1-1.fna.fbcdn.net/v/t39.30808-1/243095782_416661036495945_3843362260429099279_n.png?stp=cp0_dst-png_s40x40&_nc_cat=1&ccb=1-7&_nc_sid=2d3e12&_nc_ohc=pGNKYYiG82gQ7kNvwGLgqmB&_nc_oc=AdmpIOT7GNKe9qxJgFM-EEuF78UvDx97YygzhxiRXW5nXDyZmQScZzHnWAFlGmn8VBk"
- },
- "text": "Rocket? Stacking. Crew training? Underway. Mission patch? Ready to go.",
- "link": "https://go.nasa.gov/41ZErWJ",
- "likes": 1813,
- "comments": 190,
- "shares": 456,
- "media": [
- {
- "thumbnail": "https://scontent.fssz1-1.fna.fbcdn.net/v/t15.5256-10/488073346_1027101039315356_6805938007276905855_n.jpg?_nc_cat=109&ccb=1-7&_nc_sid=7965db&_nc_ohc=M4hIzfAIbdAQ7kNvwFnbXVw&_nc_oc=AdmJODt8am5l58TuwIbYLbEMK_w9IFb6uaUqiq7SCtNI9ouf4Xd_nZcifKpRLWSsclg"
- }
- ]
- }
+ {
+ "facebookUrl": "https://www.facebook.com/nasa",
+ "postId": "1212614090233965",
+ "pageName": "NASA",
+ "url": "https://www.facebook.com/NASA/videos/958890849561531/",
+ "time": "2025-04-03T18:06:29.000Z",
+ "user": {
+ "id": "100044561550831",
+ "name": "NASA - National Aeronautics and Space Administration",
+ "profileUrl": "https://www.facebook.com/NASA",
+ "profilePic": "https://scontent.fssz1-1.fna.fbcdn.net/v/t39.30808-1/243095782_416661036495945_3843362260429099279_n.png?stp=cp0_dst-png_s40x40&_nc_cat=1&ccb=1-7&_nc_sid=2d3e12&_nc_ohc=pGNKYYiG82gQ7kNvwGLgqmB&_nc_oc=AdmpIOT7GNKe9qxJgFM-EEuF78UvDx97YygzhxiRXW5nXDyZmQScZzHnWAFlGmn8VBk"
+ },
+ "text": "Rocket? Stacking. Crew training? Underway. Mission patch? Ready to go.",
+ "link": "https://go.nasa.gov/41ZErWJ",
+ "likes": 1813,
+ "comments": 190,
+ "shares": 456,
+ "media": [
+ {
+ "thumbnail": "https://scontent.fssz1-1.fna.fbcdn.net/v/t15.5256-10/488073346_1027101039315356_6805938007276905855_n.jpg?_nc_cat=109&ccb=1-7&_nc_sid=7965db&_nc_ohc=M4hIzfAIbdAQ7kNvwFnbXVw&_nc_oc=AdmJODt8am5l58TuwIbYLbEMK_w9IFb6uaUqiq7SCtNI9ouf4Xd_nZcifKpRLWSsclg"
+ }
+ ]
+ }
]
```
@@ -246,4 +246,3 @@ Looking for more than just Facebook? You can use other native Make apps powered
- [Amazon](/platform/integrations/make/amazon)
And more! Because you can access any of thousands of our scrapers on Apify Store by using the [general Apify connections](https://www.make.com/en/integrations/apify).
-
diff --git a/sources/platform/integrations/workflows-and-notifications/make/instagram.md b/sources/platform/integrations/workflows-and-notifications/make/instagram.md
index 1c71f0d07f..6784688512 100644
--- a/sources/platform/integrations/workflows-and-notifications/make/instagram.md
+++ b/sources/platform/integrations/workflows-and-notifications/make/instagram.md
@@ -30,7 +30,6 @@ To use these modules, you need an [Apify account](https://console.apify.com) and
1. Find your token under **Personal API tokens** section. You can also create a new API token with multiple customizable permissions by clicking on **+ Create a new token**.
1. Click the **Copy** icon next to your API token to copy it to your clipboard. Then, return to your Make scenario interface.
-

1. In Make, click **Add** to open the **Create a connection** dialog of the chosen Apify Scraper module.
@@ -89,7 +88,7 @@ For each Instagram profile, you will extract:
### Extract Instagram comments
-Retrieve comments from posts by calling [Apify's Instagram Comments Scraper](https://apify.com/apify/instagram-comment-scraper). To set up this module, you will need to add Instagram posts or reels to extract the comments from, the desired number of comments, and optionally, the order of comments, and replies.
+Retrieve comments from posts by calling [Apify's Instagram Comments Scraper](https://apify.com/apify/instagram-comment-scraper). To set up this module, you will need to add Instagram posts or reels to extract the comments from, the desired number of comments, and optionally, the order of comments, and replies.
For each Instagram post, you will extract:
diff --git a/sources/platform/integrations/workflows-and-notifications/make/llm.md b/sources/platform/integrations/workflows-and-notifications/make/llm.md
index b806090cae..92678f85fb 100644
--- a/sources/platform/integrations/workflows-and-notifications/make/llm.md
+++ b/sources/platform/integrations/workflows-and-notifications/make/llm.md
@@ -46,15 +46,15 @@ Use Standard Settings to quickly search the web and extract content with optimiz
The module supports two modes:
- _Search mode_ (keywords)
- - Queries Google Search with your keywords (supports advanced operators)
- - Retrieves the top N organic results
- - Loads each result and extracts the main content
- - Returns Markdown-formatted content
+ - Queries Google Search with your keywords (supports advanced operators)
+ - Retrieves the top N organic results
+ - Loads each result and extracts the main content
+ - Returns Markdown-formatted content
- _Direct URL mode_ (URL)
- - Navigates to a specific URL
- - Extracts page content
- - Skips Google Search
+ - Navigates to a specific URL
+ - Extracts page content
+ - Skips Google Search
#### How it works
@@ -64,28 +64,28 @@ When you provide keywords, the module runs Google Search, parses the results, an
```json title="Standard Settings output (shortened)"
{
- "query": "web browser for RAG pipelines -site:reddit.com",
- "crawl": {
- "httpStatusCode": 200,
- "httpStatusMessage": "OK",
- "loadedAt": "2025-06-30T10:15:23.456Z",
- "uniqueKey": "https://example.com/article",
- "requestStatus": "handled"
- },
- "searchResult": {
- "title": "Building RAG Pipelines with Web Browsers",
- "description": "Integrate web browsing into your RAG pipeline for real-time retrieval.",
- "url": "https://example.com/article",
- "resultType": "organic",
- "rank": 1
- },
- "metadata": {
- "title": "Building RAG Pipelines with Web Browsers",
- "description": "Add web browsing to RAG systems",
- "languageCode": "en",
- "url": "https://example.com/article"
- },
- "markdown": "# Building RAG Pipelines with Web Browsers\n\n..."
+ "query": "web browser for RAG pipelines -site:reddit.com",
+ "crawl": {
+ "httpStatusCode": 200,
+ "httpStatusMessage": "OK",
+ "loadedAt": "2025-06-30T10:15:23.456Z",
+ "uniqueKey": "https://example.com/article",
+ "requestStatus": "handled"
+ },
+ "searchResult": {
+ "title": "Building RAG Pipelines with Web Browsers",
+ "description": "Integrate web browsing into your RAG pipeline for real-time retrieval.",
+ "url": "https://example.com/article",
+ "resultType": "organic",
+ "rank": 1
+ },
+ "metadata": {
+ "title": "Building RAG Pipelines with Web Browsers",
+ "description": "Add web browsing to RAG systems",
+ "languageCode": "en",
+ "url": "https://example.com/article"
+ },
+ "markdown": "# Building RAG Pipelines with Web Browsers\n\n..."
}
```
@@ -123,36 +123,36 @@ Advanced Settings give you full control over search and extraction. Use it for c
```json title="Advanced Settings output (shortened)"
{
- "query": "advanced RAG implementation strategies",
- "crawl": {
- "httpStatusCode": 200,
- "httpStatusMessage": "OK",
- "loadedUrl": "https://ai-research.com/rag-strategies",
- "loadedTime": "2025-06-30T10:45:12.789Z",
- "referrerUrl": "https://www.google.com/search?q=advanced+RAG+implementation+strategies",
- "uniqueKey": "https://ai-research.com/rag-strategies",
- "requestStatus": "handled",
- "depth": 0
- },
- "searchResult": {
- "title": "Advanced RAG Implementation: A Complete Guide",
- "description": "Cutting-edge strategies for RAG systems.",
- "url": "https://ai-research.com/rag-strategies",
- "resultType": "organic",
- "rank": 1
- },
- "metadata": {
- "canonicalUrl": "https://ai-research.com/rag-strategies",
- "title": "Advanced RAG Implementation: A Complete Guide | AI Research",
- "description": "Vector DBs, chunking, and optimization techniques.",
- "languageCode": "en"
- },
- "markdown": "# Advanced RAG Implementation: A Complete Guide\n\n...",
- "debug": {
- "extractorUsed": "readableText",
- "elementsRemoved": 47,
- "elementsClicked": 3
- }
+ "query": "advanced RAG implementation strategies",
+ "crawl": {
+ "httpStatusCode": 200,
+ "httpStatusMessage": "OK",
+ "loadedUrl": "https://ai-research.com/rag-strategies",
+ "loadedTime": "2025-06-30T10:45:12.789Z",
+ "referrerUrl": "https://www.google.com/search?q=advanced+RAG+implementation+strategies",
+ "uniqueKey": "https://ai-research.com/rag-strategies",
+ "requestStatus": "handled",
+ "depth": 0
+ },
+ "searchResult": {
+ "title": "Advanced RAG Implementation: A Complete Guide",
+ "description": "Cutting-edge strategies for RAG systems.",
+ "url": "https://ai-research.com/rag-strategies",
+ "resultType": "organic",
+ "rank": 1
+ },
+ "metadata": {
+ "canonicalUrl": "https://ai-research.com/rag-strategies",
+ "title": "Advanced RAG Implementation: A Complete Guide | AI Research",
+ "description": "Vector DBs, chunking, and optimization techniques.",
+ "languageCode": "en"
+ },
+ "markdown": "# Advanced RAG Implementation: A Complete Guide\n\n...",
+ "debug": {
+ "extractorUsed": "readableText",
+ "elementsRemoved": 47,
+ "elementsClicked": 3
+ }
}
```
diff --git a/sources/platform/integrations/workflows-and-notifications/make/maps.md b/sources/platform/integrations/workflows-and-notifications/make/maps.md
index bfcf0fa87f..820fa5b7b1 100644
--- a/sources/platform/integrations/workflows-and-notifications/make/maps.md
+++ b/sources/platform/integrations/workflows-and-notifications/make/maps.md
@@ -74,54 +74,54 @@ Categories can be general (e.g., "restaurant") which includes all variations lik
```json title="Business lead data, shortened sample"
{
- "searchString": "Restaurant in Staten Island",
- "rank": 3,
- "title": "Kim's Island",
- "placeId": "ChIJJaKM4pyKwokRCZ8XaBNj_Gw",
- "categoryName": "Chinese restaurant",
- "price": "$10–20",
- "rating": 4.6,
- "reviewsCount": 182,
- "featuredInLists": ["Best Chinese Food", "Top Rated Restaurants"],
-
- // Complete address information for targeted outreach
- "address": "175 Main St, Staten Island, NY 10307",
- "neighborhood": "Tottenville",
- "street": "175 Main St",
- "city": "Staten Island",
- "postalCode": "10307",
- "state": "New York",
- "countryCode": "US",
- "plusCode": "GQ62+8M Staten Island, New York",
-
- // Multiple contact channels
- "website": "http://kimsislandsi.com/",
- "phone": "(718) 356-5168",
- "phoneUnformatted": "+17183565168",
- "email": "info@kimsislandsi.com", // From website enrichment
-
- // Business qualification data
- "yearsInBusiness": 12,
- "claimThisBusiness": false, // Verified listing
- "popular": true,
- "temporarilyClosed": false,
-
- // Precise location for territory planning
- "location": {
- "lat": 40.5107736,
- "lng": -74.2482624
- },
-
- // Operational insights for scheduling outreach
- "openingHours": {
- "Monday": "11:00 AM - 10:00 PM",
- "Tuesday": "11:00 AM - 10:00 PM",
- "Wednesday": "11:00 AM - 10:00 PM",
- "Thursday": "11:00 AM - 10:00 PM",
- "Friday": "11:00 AM - 11:00 PM",
- "Saturday": "11:00 AM - 11:00 PM",
- "Sunday": "12:00 PM - 9:30 PM"
- }
+ "searchString": "Restaurant in Staten Island",
+ "rank": 3,
+ "title": "Kim's Island",
+ "placeId": "ChIJJaKM4pyKwokRCZ8XaBNj_Gw",
+ "categoryName": "Chinese restaurant",
+ "price": "$10–20",
+ "rating": 4.6,
+ "reviewsCount": 182,
+ "featuredInLists": ["Best Chinese Food", "Top Rated Restaurants"],
+
+ // Complete address information for targeted outreach
+ "address": "175 Main St, Staten Island, NY 10307",
+ "neighborhood": "Tottenville",
+ "street": "175 Main St",
+ "city": "Staten Island",
+ "postalCode": "10307",
+ "state": "New York",
+ "countryCode": "US",
+ "plusCode": "GQ62+8M Staten Island, New York",
+
+ // Multiple contact channels
+ "website": "http://kimsislandsi.com/",
+ "phone": "(718) 356-5168",
+ "phoneUnformatted": "+17183565168",
+ "email": "info@kimsislandsi.com", // From website enrichment
+
+ // Business qualification data
+ "yearsInBusiness": 12,
+ "claimThisBusiness": false, // Verified listing
+ "popular": true,
+ "temporarilyClosed": false,
+
+ // Precise location for territory planning
+ "location": {
+ "lat": 40.5107736,
+ "lng": -74.2482624
+ },
+
+ // Operational insights for scheduling outreach
+ "openingHours": {
+ "Monday": "11:00 AM - 10:00 PM",
+ "Tuesday": "11:00 AM - 10:00 PM",
+ "Wednesday": "11:00 AM - 10:00 PM",
+ "Thursday": "11:00 AM - 10:00 PM",
+ "Friday": "11:00 AM - 11:00 PM",
+ "Saturday": "11:00 AM - 11:00 PM",
+ "Sunday": "12:00 PM - 9:30 PM"
+ }
}
```
@@ -178,140 +178,127 @@ This module provides the most flexible options for defining where and how to sea
```json title="Advances output data, shortened sample"
{
- "searchString": "coffee shop",
- "rank": 9,
- "searchPageUrl": "https://www.google.com/maps/search/coffee%20shop/@40.748508724216016,-74.0186770781978,17z?hl=en",
- "searchPageLoadedUrl": "https://www.google.com/maps/search/coffee%20shop/@40.748508724216016,-74.0186770781978,17z?hl=en",
- "isAdvertisement": false,
- "title": "Bluestone Lane Chelsea Piers Café",
- "price": "$20–30",
- "categoryName": "Coffee shop",
-
- // Address and location data
- "address": "62 Chelsea Piers Pier 62, New York, NY 10011",
- "neighborhood": "Manhattan",
- "street": "62 Chelsea Piers Pier 62",
- "city": "New York",
- "postalCode": "10011",
- "state": "New York",
- "countryCode": "US",
- "location": {
- "lat": 40.7485378,
- "lng": -74.0087457
- },
- "plusCode": "GQ62+8M Staten Island, New York",
-
- // Contact information
- "website": "https://bluestonelane.com/?y_source=1_MjMwNjk1NDAtNzE1LWxvY2F0aW9uLndlYnNpdGU%3D",
- "phone": "(718) 374-6858",
- "phoneUnformatted": "+17183746858",
-
- // Rating and reviews
- "totalScore": 4.3,
- "reviewsCount": 425,
- "imagesCount": 659,
-
- // Business identifiers
- "claimThisBusiness": false,
- "permanentlyClosed": false,
- "temporarilyClosed": false,
- "placeId": "ChIJDTUgz1dZwokRtsQ97Tbf0cA",
- "categories": ["Coffee shop", "Cafe"],
- "fid": "0x89c25957cf20350d:0xc0d1df36ed3dc4b6",
- "cid": "13894131752416167094",
-
- // Operating hours
- "openingHours": [
- {"day": "Monday", "hours": "7 AM to 6 PM"},
- {"day": "Tuesday", "hours": "7 AM to 6 PM"},
- {"day": "Wednesday", "hours": "7 AM to 6 PM"},
- {"day": "Thursday", "hours": "7 AM to 6 PM"},
- {"day": "Friday", "hours": "7 AM to 6 PM"},
- {"day": "Saturday", "hours": "7 AM to 6 PM"},
- {"day": "Sunday", "hours": "7 AM to 6 PM"}
- ],
-
- // Business attributes and amenities
- "additionalInfo": {
- "Service options": [
- {"Outdoor seating": true},
- {"Curbside pickup": true},
- {"No-contact delivery": true},
- {"Delivery": true},
- {"Onsite services": true},
- {"Takeout": true},
- {"Dine-in": true}
+ "searchString": "coffee shop",
+ "rank": 9,
+ "searchPageUrl": "https://www.google.com/maps/search/coffee%20shop/@40.748508724216016,-74.0186770781978,17z?hl=en",
+ "searchPageLoadedUrl": "https://www.google.com/maps/search/coffee%20shop/@40.748508724216016,-74.0186770781978,17z?hl=en",
+ "isAdvertisement": false,
+ "title": "Bluestone Lane Chelsea Piers Café",
+ "price": "$20–30",
+ "categoryName": "Coffee shop",
+
+ // Address and location data
+ "address": "62 Chelsea Piers Pier 62, New York, NY 10011",
+ "neighborhood": "Manhattan",
+ "street": "62 Chelsea Piers Pier 62",
+ "city": "New York",
+ "postalCode": "10011",
+ "state": "New York",
+ "countryCode": "US",
+ "location": {
+ "lat": 40.7485378,
+ "lng": -74.0087457
+ },
+ "plusCode": "GQ62+8M Staten Island, New York",
+
+ // Contact information
+ "website": "https://bluestonelane.com/?y_source=1_MjMwNjk1NDAtNzE1LWxvY2F0aW9uLndlYnNpdGU%3D",
+ "phone": "(718) 374-6858",
+ "phoneUnformatted": "+17183746858",
+
+ // Rating and reviews
+ "totalScore": 4.3,
+ "reviewsCount": 425,
+ "imagesCount": 659,
+
+ // Business identifiers
+ "claimThisBusiness": false,
+ "permanentlyClosed": false,
+ "temporarilyClosed": false,
+ "placeId": "ChIJDTUgz1dZwokRtsQ97Tbf0cA",
+ "categories": ["Coffee shop", "Cafe"],
+ "fid": "0x89c25957cf20350d:0xc0d1df36ed3dc4b6",
+ "cid": "13894131752416167094",
+
+ // Operating hours
+ "openingHours": [
+ { "day": "Monday", "hours": "7 AM to 6 PM" },
+ { "day": "Tuesday", "hours": "7 AM to 6 PM" },
+ { "day": "Wednesday", "hours": "7 AM to 6 PM" },
+ { "day": "Thursday", "hours": "7 AM to 6 PM" },
+ { "day": "Friday", "hours": "7 AM to 6 PM" },
+ { "day": "Saturday", "hours": "7 AM to 6 PM" },
+ { "day": "Sunday", "hours": "7 AM to 6 PM" }
],
- "Highlights": [
- {"Great coffee": true},
- {"Great tea selection": true},
- {"Live music": true},
- {"Live performances": true},
- {"Rooftop seating": true}
- ],
- "Popular for": [
- {"Breakfast": true},
- {"Lunch": true},
- {"Solo dining": true},
- {"Good for working on laptop": true}
- ],
- "Accessibility": [
- {"Wheelchair accessible entrance": true},
- {"Wheelchair accessible parking lot": true},
- {"Wheelchair accessible restroom": true},
- {"Wheelchair accessible seating": true}
- ],
- "Offerings": [
- {"Coffee": true},
- {"Comfort food": true},
- {"Organic dishes": true},
- {"Prepared foods": true},
- {"Quick bite": true},
- {"Small plates": true},
- {"Vegetarian options": true}
- ],
- "Dining options": [
- {"Breakfast": true},
- {"Brunch": true},
- {"Lunch": true},
- {"Catering": true},
- {"Dessert": true},
- {"Seating": true}
- ],
- "Amenities": [
- {"Restroom": true},
- {"Wi-Fi": true},
- {"Free Wi-Fi": true}
- ],
- "Atmosphere": [
- {"Casual": true},
- {"Cozy": true},
- {"Trendy": true}
- ],
- "Crowd": [
- {"Family-friendly": true},
- {"LGBTQ+ friendly": true},
- {"Transgender safespace": true}
- ],
- "Planning": [
- {"Accepts reservations": true}
- ],
- "Payments": [
- {"Credit cards": true},
- {"Debit cards": true},
- {"NFC mobile payments": true}
- ],
- "Children": [
- {"Good for kids": true},
- {"High chairs": true}
- ]
- },
-
- // Image and metadata
- "imageUrl": "https://lh3.googleusercontent.com/p/AF1QipMl6-SnuqYEeE3mD54M0q5D5nysRUZQj1BB0g8=w408-h272-k-no",
- "kgmid": "/g/11ph8zh6sg",
- "url": "https://www.google.com/maps/search/?api=1&query=Bluestone%20Lane%20Chelsea%20Piers%20Caf%C3%A9&query_place_id=ChIJDTUgz1dZwokRtsQ97Tbf0cA",
- "scrapedAt": "2025-04-22T14:23:34.961Z"
+
+ // Business attributes and amenities
+ "additionalInfo": {
+ "Service options": [
+ { "Outdoor seating": true },
+ { "Curbside pickup": true },
+ { "No-contact delivery": true },
+ { "Delivery": true },
+ { "Onsite services": true },
+ { "Takeout": true },
+ { "Dine-in": true }
+ ],
+ "Highlights": [
+ { "Great coffee": true },
+ { "Great tea selection": true },
+ { "Live music": true },
+ { "Live performances": true },
+ { "Rooftop seating": true }
+ ],
+ "Popular for": [
+ { "Breakfast": true },
+ { "Lunch": true },
+ { "Solo dining": true },
+ { "Good for working on laptop": true }
+ ],
+ "Accessibility": [
+ { "Wheelchair accessible entrance": true },
+ { "Wheelchair accessible parking lot": true },
+ { "Wheelchair accessible restroom": true },
+ { "Wheelchair accessible seating": true }
+ ],
+ "Offerings": [
+ { "Coffee": true },
+ { "Comfort food": true },
+ { "Organic dishes": true },
+ { "Prepared foods": true },
+ { "Quick bite": true },
+ { "Small plates": true },
+ { "Vegetarian options": true }
+ ],
+ "Dining options": [
+ { "Breakfast": true },
+ { "Brunch": true },
+ { "Lunch": true },
+ { "Catering": true },
+ { "Dessert": true },
+ { "Seating": true }
+ ],
+ "Amenities": [{ "Restroom": true }, { "Wi-Fi": true }, { "Free Wi-Fi": true }],
+ "Atmosphere": [{ "Casual": true }, { "Cozy": true }, { "Trendy": true }],
+ "Crowd": [
+ { "Family-friendly": true },
+ { "LGBTQ+ friendly": true },
+ { "Transgender safespace": true }
+ ],
+ "Planning": [{ "Accepts reservations": true }],
+ "Payments": [
+ { "Credit cards": true },
+ { "Debit cards": true },
+ { "NFC mobile payments": true }
+ ],
+ "Children": [{ "Good for kids": true }, { "High chairs": true }]
+ },
+
+ // Image and metadata
+ "imageUrl": "https://lh3.googleusercontent.com/p/AF1QipMl6-SnuqYEeE3mD54M0q5D5nysRUZQj1BB0g8=w408-h272-k-no",
+ "kgmid": "/g/11ph8zh6sg",
+ "url": "https://www.google.com/maps/search/?api=1&query=Bluestone%20Lane%20Chelsea%20Piers%20Caf%C3%A9&query_place_id=ChIJDTUgz1dZwokRtsQ97Tbf0cA",
+ "scrapedAt": "2025-04-22T14:23:34.961Z"
}
```
diff --git a/sources/platform/integrations/workflows-and-notifications/make/search.md b/sources/platform/integrations/workflows-and-notifications/make/search.md
index 501021747e..af21490f1e 100644
--- a/sources/platform/integrations/workflows-and-notifications/make/search.md
+++ b/sources/platform/integrations/workflows-and-notifications/make/search.md
@@ -13,7 +13,7 @@ The Google search modules from [Apify](https://apify.com) allows you to crawl Go
To use the module, you need an [Apify account](https://console.apify.com) and an [API token](https://docs.apify.com/platform/integrations/api#api-token), which you can find in the Apify Console under **Settings > Integrations**. After connecting, you can automate data extraction and incorporate the results into your workflows.
-## Connect Apify Scraper for Google Search modules to Make
+## Connect Apify Scraper for Google Search modules to Make
1. Create an account at [Apify](https://console.apify.com/). You can sign up using your email, Gmail, or GitHub account.
@@ -55,46 +55,46 @@ For each Google Search query, you will extract:
```json title="Search results data, shortened sample"
{
- "searchQuery": {
- "term": "javascript",
- "page": 1,
- "type": "SEARCH",
- "countryCode": "us",
- "languageCode": "en",
- "locationUule": null,
- "device": "DESKTOP"
- },
- "url": "https://www.google.com/search?q=javascript&hl=en&gl=us&num=10",
- "hasNextPage": true,
- "resultsCount": 13600000000,
- "organicResults": [
- {
- "title": "JavaScript Tutorial",
- "url": "https://www.w3schools.com/js/",
- "displayedUrl": "https://www.w3schools.com › js",
- "description": "JavaScript is the world's most popular programming language. JavaScript is the programming language of the Web. JavaScript is easy to learn.",
- "position": 1,
- "emphasizedKeywords": ["JavaScript", "JavaScript", "JavaScript", "JavaScript"],
- "siteLinks": []
- }
- ],
- "paidResults": [
- {
- "title": "JavaScript Online Course - Start Learning JavaScript",
- "url": "https://www.example-ad.com/javascript",
- "displayedUrl": "https://www.example-ad.com",
- "description": "Learn JavaScript from scratch with our comprehensive online course. Start your coding journey today!",
- "position": 1,
- "type": "SHOPPING"
- }
- ],
- "peopleAlsoAsk": [
- {
- "question": "What is JavaScript used for?",
- "answer": "JavaScript is used for creating interactive elements on websites, browser games, frontend of web applications, mobile applications, and server applications...",
- "url": "https://www.example.com/javascript-uses"
- }
- ]
+ "searchQuery": {
+ "term": "javascript",
+ "page": 1,
+ "type": "SEARCH",
+ "countryCode": "us",
+ "languageCode": "en",
+ "locationUule": null,
+ "device": "DESKTOP"
+ },
+ "url": "https://www.google.com/search?q=javascript&hl=en&gl=us&num=10",
+ "hasNextPage": true,
+ "resultsCount": 13600000000,
+ "organicResults": [
+ {
+ "title": "JavaScript Tutorial",
+ "url": "https://www.w3schools.com/js/",
+ "displayedUrl": "https://www.w3schools.com › js",
+ "description": "JavaScript is the world's most popular programming language. JavaScript is the programming language of the Web. JavaScript is easy to learn.",
+ "position": 1,
+ "emphasizedKeywords": ["JavaScript", "JavaScript", "JavaScript", "JavaScript"],
+ "siteLinks": []
+ }
+ ],
+ "paidResults": [
+ {
+ "title": "JavaScript Online Course - Start Learning JavaScript",
+ "url": "https://www.example-ad.com/javascript",
+ "displayedUrl": "https://www.example-ad.com",
+ "description": "Learn JavaScript from scratch with our comprehensive online course. Start your coding journey today!",
+ "position": 1,
+ "type": "SHOPPING"
+ }
+ ],
+ "peopleAlsoAsk": [
+ {
+ "question": "What is JavaScript used for?",
+ "answer": "JavaScript is used for creating interactive elements on websites, browser games, frontend of web applications, mobile applications, and server applications...",
+ "url": "https://www.example.com/javascript-uses"
+ }
+ ]
}
```
diff --git a/sources/platform/integrations/workflows-and-notifications/make/tiktok.md b/sources/platform/integrations/workflows-and-notifications/make/tiktok.md
index e99f79f7c4..8bf73bbd96 100644
--- a/sources/platform/integrations/workflows-and-notifications/make/tiktok.md
+++ b/sources/platform/integrations/workflows-and-notifications/make/tiktok.md
@@ -45,7 +45,7 @@ Get profile details via [Apify's TikTok Profile Scraper](https://apify.com/clock
For each TikTok profile, you will extract:
-- _Basic profile details_: name, nickname, bio, ID, and profile URL.
+- _Basic profile details_: name, nickname, bio, ID, and profile URL.
- _Account status_: whether the account is verified or not, and if it's a business and seller account.
- _Follower and engagement metrics_: number of followers and accounts followed.
- _Profile avatar_: avatar URLs.
@@ -54,7 +54,7 @@ For each TikTok profile, you will extract:
```json title="Profile data, shortened sample"
[
{
- "authorMeta": {
+ "authorMeta": {
"id": "6987048613642159109",
"name": "nasaofficial",
"profileUrl": "https://www.tiktok.com/@nasaofficial",
@@ -80,14 +80,14 @@ For each TikTok profile, you will extract:
"video": 0,
"digg": 0
},
- "input": "https://www.tiktok.com/@nasaofficial",
+ "input": "https://www.tiktok.com/@nasaofficial"
}
]
```
### Extract TikTok comments
-Retrieve comments from videos by calling [Apify's TikTok Comments Scraper](https://apify.com/clockworks/tiktok-comments-scraper). To set up this module, you will need to add TikTok video URLs to extract the comments from, the desired number of comments, and optionally, the maximum number of replies per comment.
+Retrieve comments from videos by calling [Apify's TikTok Comments Scraper](https://apify.com/clockworks/tiktok-comments-scraper). To set up this module, you will need to add TikTok video URLs to extract the comments from, the desired number of comments, and optionally, the maximum number of replies per comment.
For each TikTok video, you will extract:
@@ -119,8 +119,8 @@ For each TikTok video, you will extract:
"uid": "7095709566285480965",
"cid": "7338091744464978720",
"avatarThumbnail": "https://p16-sign-useast2a.tiktokcdn.com/tos-useast2a-avt-0068-euttp/2c511269b14f70cca0c11c3285ddc668~tplv-tiktokx-cropcenter:100:100.jpg?dr=10399&nonce=11659&refresh_token=c2a577eebaa68fc73aac11e9b99fefcb&x-expires=1739973600&x-signature=LUTudhynytGwrfL9MKFHKO8v7EA%3D&idc=no1a&ps=13740610&shcp=ff37627b&shp=30310797&t=4d5b0474"
- },
- ]
+ }
+]
```
### Extract TikTok hashtags
diff --git a/sources/platform/integrations/workflows-and-notifications/make/youtube.md b/sources/platform/integrations/workflows-and-notifications/make/youtube.md
index 04a330fb57..4c017a1e38 100644
--- a/sources/platform/integrations/workflows-and-notifications/make/youtube.md
+++ b/sources/platform/integrations/workflows-and-notifications/make/youtube.md
@@ -52,122 +52,122 @@ For YouTube URLs, you can extract:
```json title="Channel data sample"
{
- "id": "HV6OlMPn5sI",
- "title": "Raimu - The Spirit Within 🍃 [lofi hip hop/relaxing beats]",
- "duration": "29:54",
- "channelName": "Lofi Girl",
- "channelUrl": "https://www.youtube.com/channel/UCSJ4gkVC6NrvII8umztf0Ow",
- "date": "10 months ago",
- "url": "https://www.youtube.com/watch?v=HV6OlMPn5sI",
- "viewCount": 410458,
- "fromYTUrl": "https://www.youtube.com/@LofiGirl/videos",
- "channelDescription": "\"That girl studying by the window non-stop\"\n\n🎧 | Listen on Spotify, Apple music and more\n→ https://bit.ly/lofigirl-playlists\n\n💬 | Join the Lofi Girl community \n→ https://bit.ly/lofigirl-discord\n→ https://bit.ly/lofigirl-reddit\n\n🌎 | Lofi Girl on all social media\n→ https://bit.ly/lofigirl-sociaI",
- "channelDescriptionLinks": [
- {
- "text": "Discord",
- "url": "https://discord.com/invite/hUKvJnw"
- },
- {
- "text": "Tiktok",
- "url": "https://www.tiktok.com/@lofigirl/"
- },
- {
- "text": "Instagram",
- "url": "https://www.instagram.com/lofigirl/"
- },
- {
- "text": "Twitter",
- "url": "https://twitter.com/lofigirl"
- },
- {
- "text": "Spotify",
- "url": "https://open.spotify.com/playlist/0vvXsWCC9xrXsKd4FyS8kM"
- },
- {
- "text": "Apple music",
- "url": "https://music.apple.com/fr/playlist/lofi-hip-hop-music-beats-to-relax-study-to/pl.u-2aoq8mqiGo7J6A0"
- },
- {
- "text": "Merch",
- "url": "https://lofigirlshop.com/"
- }
- ],
- "channelJoinedDate": "Mar 18, 2015",
- "channelLocation": "France",
- "channelTotalVideos": 409,
- "channelTotalViews": "1,710,167,563",
- "numberOfSubscribers": 13100000,
- "isMonetized": true,
- "inputChannelUrl": "https://www.youtube.com/@LofiGirl/about"
+ "id": "HV6OlMPn5sI",
+ "title": "Raimu - The Spirit Within 🍃 [lofi hip hop/relaxing beats]",
+ "duration": "29:54",
+ "channelName": "Lofi Girl",
+ "channelUrl": "https://www.youtube.com/channel/UCSJ4gkVC6NrvII8umztf0Ow",
+ "date": "10 months ago",
+ "url": "https://www.youtube.com/watch?v=HV6OlMPn5sI",
+ "viewCount": 410458,
+ "fromYTUrl": "https://www.youtube.com/@LofiGirl/videos",
+ "channelDescription": "\"That girl studying by the window non-stop\"\n\n🎧 | Listen on Spotify, Apple music and more\n→ https://bit.ly/lofigirl-playlists\n\n💬 | Join the Lofi Girl community \n→ https://bit.ly/lofigirl-discord\n→ https://bit.ly/lofigirl-reddit\n\n🌎 | Lofi Girl on all social media\n→ https://bit.ly/lofigirl-sociaI",
+ "channelDescriptionLinks": [
+ {
+ "text": "Discord",
+ "url": "https://discord.com/invite/hUKvJnw"
+ },
+ {
+ "text": "Tiktok",
+ "url": "https://www.tiktok.com/@lofigirl/"
+ },
+ {
+ "text": "Instagram",
+ "url": "https://www.instagram.com/lofigirl/"
+ },
+ {
+ "text": "Twitter",
+ "url": "https://twitter.com/lofigirl"
+ },
+ {
+ "text": "Spotify",
+ "url": "https://open.spotify.com/playlist/0vvXsWCC9xrXsKd4FyS8kM"
+ },
+ {
+ "text": "Apple music",
+ "url": "https://music.apple.com/fr/playlist/lofi-hip-hop-music-beats-to-relax-study-to/pl.u-2aoq8mqiGo7J6A0"
+ },
+ {
+ "text": "Merch",
+ "url": "https://lofigirlshop.com/"
+ }
+ ],
+ "channelJoinedDate": "Mar 18, 2015",
+ "channelLocation": "France",
+ "channelTotalVideos": 409,
+ "channelTotalViews": "1,710,167,563",
+ "numberOfSubscribers": 13100000,
+ "isMonetized": true,
+ "inputChannelUrl": "https://www.youtube.com/@LofiGirl/about"
}
```
```json title="Video data sample"
{
- "title": "Stromae - Santé (Live From The Tonight Show Starring Jimmy Fallon)",
- "id": "CW7gfrTlr0Y",
- "url": "https://www.youtube.com/watch?v=CW7gfrTlr0Y",
- "thumbnailUrl": "https://i.ytimg.com/vi/CW7gfrTlr0Y/maxresdefault.jpg",
- "viewCount": 35582192,
- "date": "2021-12-21",
- "likes": 512238,
- "location": null,
- "channelName": "StromaeVEVO",
- "channelUrl": "http://www.youtube.com/@StromaeVEVO",
- "numberOfSubscribers": 6930000,
- "duration": "00:03:17",
- "commentsCount": 14,
- "text": "Stromae - Santé (Live From The Tonight Show Starring Jimmy Fallon on NBC)\nListen to \"La solassitude\" here: https://stromae.lnk.to/la-solassitude\nOrder my new album \"Multitude\" here: https://stromae.lnk.to/multitudeID\n--\nhttps://www.stromae.com/fr/\nhttps://www.tiktok.com/@stromae\nhttps://www.facebook.com/stromae\nhttps://www.instagram.com/stromae\nhttps://twitter.com/stromae\n / @stromae \n--\nMosaert\nPaul Van Haver (Stromae) : creative direction\nCoralie Barbier : creative direction and fashion design\nLuc Van Haver : creative direction\nGaëlle Birenbaum : communication & project manager\nEvence Guinet-Dannonay : executive assistant\nGaëlle Cools : content & community manager\nRoxane Hauzeur : textile product manager\nDiego Mitrugno : office manager\n\nPartizan\nProducer : Auguste Bas\nLine Producer : Zélie Deletrain \nProduction coordinator : Lou Bardou-Jacquet \nProduction assistant : Hugo Dao\nProduction assistant : Adrien Bossa\nProduction assistant : Basile Jan\n\nDirector : Julien Soulier \n1st assistant director : Mathieu Perez \n2nd assistant director : Leila Gentet \n\nDirector of Photography : Kaname Onoyama \n1st assistant operator : Micaela albanese\n2nd assistant operator : Florian Rey \nDoP Mantee : Zhaopeng Zhong\nMaking of : Adryen Barreyat\n\nHead Gaffer : Sophie Delorme \nElectrician : Sacha Brauman\nElectrician: Tom Devianne\nLighting designer : Aurélien Dayot\nPrelight electrician : Emmanuel Malherbe\n\nHead Grip : Dioclès Desrieux \nBest Boy grip : Eloi Perrin \nPrelight Grip : Vladimir Duranovic \n\nLocation manager : Léo Rodriguez \nLocation manager assistant : Grégoire Décatoire \nLocation manager assistant : Mathieu Barazer \n\nStylist : Sandra Gonzalez \nStylist assistant : Sarah Bernard\n\nMake Up and Hair Artist : Camille Roche \nMake up Artist : Carla Lange \nMake Up and Hair Artist : Victoria Pinto \n\nSound Engineer : Lionel Capouillez \nBackliner : Nicolas Fradet \n\nProduction Designer : Penelope Hemon \n\nChoreographer : Marion Motin \nChoreographer assistant : Jeanne Michel \n\nPost production : Royal Post\nPost-Production Director : Cindy Durand Paucsik\nEditor : Marco Novoa\nEditor assistant : Térence Nury \nGrader : Vincent Amor\nVFX Supervisor : Julien Laudicina\nGraphic designer : Quentin Mesureux \nGraphic designer : Lucas Ponçon \nFilm Lab Assistant : Hadrian Kalmbach\n\nMusicians:\nFlorian Rossi \nManoli Avgoustinatos\nSimon Schoovaerts \nYoshi Masuda \n\nDancers: \nJuliana Casas\nLydie Alberto \nRobinson Cassarino\nYohann Hebi daher\nChris Fargeot \nAudrey Hurtis \nElodie Hilsum\nDaya jones \nThéophile Bensusan \nBrandon Masele \nJean Michel Premier \nKevin Bago\nAchraf Bouzefour\nPauline Journe \nCaroline Bouquet \nManon Bouquet\nAshley Biscette \nJocelyn Laurent \nOumrata Konan\nKylian Toto\nEnzo Lesne \nSalomon Mpondo-Dicka\nSandrine Monar \nKarl-Ruben Noel\n\n#Stromae #Sante #JimmyFallon",
- "descriptionLinks": [
- {
- "url": "https://stromae.lnk.to/la-solassitude",
- "text": "https://stromae.lnk.to/la-solassitude"
- },
- {
- "url": "https://stromae.lnk.to/multitudeID",
- "text": "https://stromae.lnk.to/multitudeID"
- },
- {
- "url": "https://www.stromae.com/fr/",
- "text": "https://www.stromae.com/fr/"
- },
- {
- "url": "https://www.tiktok.com/@stromae",
- "text": "https://www.tiktok.com/@stromae"
- },
- {
- "url": "https://www.facebook.com/stromae",
- "text": "https://www.facebook.com/stromae"
- },
- {
- "url": "https://www.instagram.com/stromae",
- "text": "https://www.instagram.com/stromae"
- },
- {
- "url": "https://twitter.com/stromae",
- "text": "https://twitter.com/stromae"
- },
- {
- "url": "https://www.youtube.com/channel/UCXF0YCBWewAj3RytJUAivGA",
- "text": " / @stromae "
- },
- {
- "url": "https://www.youtube.com/hashtag/stromae",
- "text": "#Stromae"
- },
- {
- "url": "https://www.youtube.com/hashtag/sante",
- "text": "#Sante"
- },
- {
- "url": "https://www.youtube.com/hashtag/jimmyfallon",
- "text": "#JimmyFallon"
- }
- ],
- "subtitles": null,
- "comments": null,
- "isMonetized": true,
- "commentsTurnedOff": false
+ "title": "Stromae - Santé (Live From The Tonight Show Starring Jimmy Fallon)",
+ "id": "CW7gfrTlr0Y",
+ "url": "https://www.youtube.com/watch?v=CW7gfrTlr0Y",
+ "thumbnailUrl": "https://i.ytimg.com/vi/CW7gfrTlr0Y/maxresdefault.jpg",
+ "viewCount": 35582192,
+ "date": "2021-12-21",
+ "likes": 512238,
+ "location": null,
+ "channelName": "StromaeVEVO",
+ "channelUrl": "http://www.youtube.com/@StromaeVEVO",
+ "numberOfSubscribers": 6930000,
+ "duration": "00:03:17",
+ "commentsCount": 14,
+ "text": "Stromae - Santé (Live From The Tonight Show Starring Jimmy Fallon on NBC)\nListen to \"La solassitude\" here: https://stromae.lnk.to/la-solassitude\nOrder my new album \"Multitude\" here: https://stromae.lnk.to/multitudeID\n--\nhttps://www.stromae.com/fr/\nhttps://www.tiktok.com/@stromae\nhttps://www.facebook.com/stromae\nhttps://www.instagram.com/stromae\nhttps://twitter.com/stromae\n / @stromae \n--\nMosaert\nPaul Van Haver (Stromae) : creative direction\nCoralie Barbier : creative direction and fashion design\nLuc Van Haver : creative direction\nGaëlle Birenbaum : communication & project manager\nEvence Guinet-Dannonay : executive assistant\nGaëlle Cools : content & community manager\nRoxane Hauzeur : textile product manager\nDiego Mitrugno : office manager\n\nPartizan\nProducer : Auguste Bas\nLine Producer : Zélie Deletrain \nProduction coordinator : Lou Bardou-Jacquet \nProduction assistant : Hugo Dao\nProduction assistant : Adrien Bossa\nProduction assistant : Basile Jan\n\nDirector : Julien Soulier \n1st assistant director : Mathieu Perez \n2nd assistant director : Leila Gentet \n\nDirector of Photography : Kaname Onoyama \n1st assistant operator : Micaela albanese\n2nd assistant operator : Florian Rey \nDoP Mantee : Zhaopeng Zhong\nMaking of : Adryen Barreyat\n\nHead Gaffer : Sophie Delorme \nElectrician : Sacha Brauman\nElectrician: Tom Devianne\nLighting designer : Aurélien Dayot\nPrelight electrician : Emmanuel Malherbe\n\nHead Grip : Dioclès Desrieux \nBest Boy grip : Eloi Perrin \nPrelight Grip : Vladimir Duranovic \n\nLocation manager : Léo Rodriguez \nLocation manager assistant : Grégoire Décatoire \nLocation manager assistant : Mathieu Barazer \n\nStylist : Sandra Gonzalez \nStylist assistant : Sarah Bernard\n\nMake Up and Hair Artist : Camille Roche \nMake up Artist : Carla Lange \nMake Up and Hair Artist : Victoria Pinto \n\nSound Engineer : Lionel Capouillez \nBackliner : Nicolas Fradet \n\nProduction Designer : Penelope Hemon \n\nChoreographer : Marion Motin \nChoreographer assistant : Jeanne Michel \n\nPost production : Royal Post\nPost-Production Director : Cindy Durand Paucsik\nEditor : Marco Novoa\nEditor assistant : Térence Nury \nGrader : Vincent Amor\nVFX Supervisor : Julien Laudicina\nGraphic designer : Quentin Mesureux \nGraphic designer : Lucas Ponçon \nFilm Lab Assistant : Hadrian Kalmbach\n\nMusicians:\nFlorian Rossi \nManoli Avgoustinatos\nSimon Schoovaerts \nYoshi Masuda \n\nDancers: \nJuliana Casas\nLydie Alberto \nRobinson Cassarino\nYohann Hebi daher\nChris Fargeot \nAudrey Hurtis \nElodie Hilsum\nDaya jones \nThéophile Bensusan \nBrandon Masele \nJean Michel Premier \nKevin Bago\nAchraf Bouzefour\nPauline Journe \nCaroline Bouquet \nManon Bouquet\nAshley Biscette \nJocelyn Laurent \nOumrata Konan\nKylian Toto\nEnzo Lesne \nSalomon Mpondo-Dicka\nSandrine Monar \nKarl-Ruben Noel\n\n#Stromae #Sante #JimmyFallon",
+ "descriptionLinks": [
+ {
+ "url": "https://stromae.lnk.to/la-solassitude",
+ "text": "https://stromae.lnk.to/la-solassitude"
+ },
+ {
+ "url": "https://stromae.lnk.to/multitudeID",
+ "text": "https://stromae.lnk.to/multitudeID"
+ },
+ {
+ "url": "https://www.stromae.com/fr/",
+ "text": "https://www.stromae.com/fr/"
+ },
+ {
+ "url": "https://www.tiktok.com/@stromae",
+ "text": "https://www.tiktok.com/@stromae"
+ },
+ {
+ "url": "https://www.facebook.com/stromae",
+ "text": "https://www.facebook.com/stromae"
+ },
+ {
+ "url": "https://www.instagram.com/stromae",
+ "text": "https://www.instagram.com/stromae"
+ },
+ {
+ "url": "https://twitter.com/stromae",
+ "text": "https://twitter.com/stromae"
+ },
+ {
+ "url": "https://www.youtube.com/channel/UCXF0YCBWewAj3RytJUAivGA",
+ "text": " / @stromae "
+ },
+ {
+ "url": "https://www.youtube.com/hashtag/stromae",
+ "text": "#Stromae"
+ },
+ {
+ "url": "https://www.youtube.com/hashtag/sante",
+ "text": "#Sante"
+ },
+ {
+ "url": "https://www.youtube.com/hashtag/jimmyfallon",
+ "text": "#JimmyFallon"
+ }
+ ],
+ "subtitles": null,
+ "comments": null,
+ "isMonetized": true,
+ "commentsTurnedOff": false
}
```
diff --git a/sources/platform/integrations/workflows-and-notifications/n8n/index.md b/sources/platform/integrations/workflows-and-notifications/n8n/index.md
index a32fdbec4b..95d60250df 100644
--- a/sources/platform/integrations/workflows-and-notifications/n8n/index.md
+++ b/sources/platform/integrations/workflows-and-notifications/n8n/index.md
@@ -28,7 +28,7 @@ If you're running a self-hosted n8n instance, you can install the Apify communit
1. Open your n8n instance.
1. Go to **Settings > Community Nodes**.
1. Select **Install**.
-1. Enter the npm package name: `@apify/n8n-nodes-apify` (for latest version). To install a specific [version](https://www.npmjs.com/package/@apify/n8n-nodes-apify?activeTab=versions) enter e.g `@apify/n8n-nodes-apify@0.4.4`.
+1. Enter the npm package name: `@apify/n8n-nodes-apify` (for latest version). To install a specific [version](https://www.npmjs.com/package/@apify/n8n-nodes-apify?activeTab=versions) enter e.g `@apify/n8n-nodes-apify@0.4.4`.
1. Agree to the [risks](https://docs.n8n.io/integrations/community-nodes/risks/) of using community nodes and select **Install**.
1. You can now use the node in your workflows.
@@ -63,7 +63,7 @@ The Apify node offers two authentication methods to securely connect to your Api
1. Enter your Apify API token. (find it in the [Apify Console](https://console.apify.com/settings/integrations)).
1. Click **Save**.
- 
+
### OAuth2 (cloud instance only)
@@ -72,7 +72,7 @@ The Apify node offers two authentication methods to securely connect to your Api
1. Select **Connect my account** and authorize with your Apify account.
1. n8n automatically retrieves and stores the OAuth2 tokens.
- 
+
:::note Credential Control
@@ -122,37 +122,36 @@ Actions allow you to perform operations like running an Actor within a workflow.
- **Memory**: Amount of memory allocated for the Actor run, in megabytes
- **Build Tag**: Specifies the Actor build tag to run. By default, the run uses the build specified in the default run configuration for the Actor (typically `latest`)
- **Wait for finish**: Whether to wait for the run to finish before continuing. If true, the node will wait for the run to complete (successfully or not) before moving to the next node
- 
+ 
1. Add another Apify operation called **Get Dataset Items**.
- Set **Dataset ID** parameter as **defaultDatasetId** value received from the previous **Run Actor** node. This will give you the output of the Actor run
- 
+ 
1. Add any subsequent nodes (e.g. Google Sheets) to process or store the output
1. Save and execute the workflow
- 
+ 
## Use Apify Node as an AI tool
You can run Apify operations, retrieve the results, and use AI to process, analyze, and summarize the data, or generate insights and recommendations.
- 
-
+
1. Create a new workflow.
-1. **Add a trigger**: Search for and select **Chat Trigger**.
-1. **Add the AI Agent node**: Click **Add Node**, search for **AI Agent**, and select it.
+1. **Add a trigger**: Search for and select **Chat Trigger**.
+1. **Add the AI Agent node**: Click **Add Node**, search for **AI Agent**, and select it.
1. Configure the AI Agent:
- **Chat Model**: Choose the language model you want to use.
- **Memory (optional)**: Enables the AI model to remember and reference past interactions.
- - **Tools**: Search for **Apify**, select **Apify Tool**, and click **Add to Workflow**. Choose any available operation and configure it.
+ - **Tools**: Search for **Apify**, select **Apify Tool**, and click **Add to Workflow**. Choose any available operation and configure it.
1. **Run the workflow**: Save it, then provide a prompt instructing the Agent to use the Apify tool with the operations you configured earlier.
:::note
- Let the AI model define the parameters in your node when possible. Click the _sparkle_ icon next to a parameter to have the AI fill it in for you.
+Let the AI model define the parameters in your node when possible. Click the _sparkle_ icon next to a parameter to have the AI fill it in for you.
:::
- 
+
## Available Operations
@@ -190,7 +189,7 @@ Pull data from Apify storage.
#### Key-Value Stores
-- **Get Record**: Retrieves a value from a [key-value store](/platform/storage/key-value-store)
+- **Get Record**: Retrieves a value from a [key-value store](/platform/storage/key-value-store)
### Triggers
diff --git a/sources/platform/integrations/workflows-and-notifications/n8n/website-content-crawler.md b/sources/platform/integrations/workflows-and-notifications/n8n/website-content-crawler.md
index 4b79fc7fc9..726402dce5 100644
--- a/sources/platform/integrations/workflows-and-notifications/n8n/website-content-crawler.md
+++ b/sources/platform/integrations/workflows-and-notifications/n8n/website-content-crawler.md
@@ -45,7 +45,7 @@ On n8n Cloud, instance owners can toggle visibility of verified community nodes
1. Select **Connect my account** and authorize with your Apify account.
1. n8n automatically retrieves and stores the OAuth2 tokens.
- 
+
:::note Cloud API Key management
@@ -56,7 +56,6 @@ See the [**Connect** section for n8n self-hosted](#connect-self-hosted) for deta
With authentication set up, you can now create workflows that incorporate the Apify node.
-
## n8n self-hosted setup
This section explains how to install and connect the Apify node when running your own n8n instance.
@@ -96,7 +95,6 @@ If you're running a self-hosted n8n instance, you can install the Apify communit

-
## Website Content Crawler by Apify module
This module provides complete control over the content extraction process, allowing you to fine-tune every aspect of the crawling and transformation pipeline. This module is ideal for complex websites, JavaScript-heavy applications, or when you need precise control over content extraction.
@@ -132,22 +130,22 @@ For each crawled web page, you'll receive:
```json title="Sample output (shortened)"
{
- "url": "https://docs.apify.com/academy/web-scraping-for-beginners",
- "crawl": {
- "loadedUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
- "loadedTime": "2025-04-22T14:33:20.514Z",
- "referrerUrl": "https://docs.apify.com/academy",
- "depth": 1,
- "httpStatusCode": 200
- },
- "metadata": {
- "canonicalUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
- "title": "Web scraping for beginners | Apify Documentation",
- "description": "Learn the basics of web scraping with a step-by-step tutorial and practical exercises.",
- "languageCode": "en",
- "markdown": "# Web scraping for beginners\n\nWelcome to our comprehensive web scraping tutorial for beginners. This guide will take you through the fundamentals of extracting data from websites, with practical examples and exercises.\n\n## What is web scraping?\n\nWeb scraping is the process of extracting data from websites. It involves making HTTP requests to web servers, downloading HTML pages, and parsing them to extract the desired information.\n\n## Why learn web scraping?\n\n- **Data collection**: Gather information for research, analysis, or business intelligence\n- **Automation**: Save time by automating repetitive data collection tasks\n- **Integration**: Connect web data with your applications or databases\n- **Monitoring**: Track changes on websites automatically\n\n## Getting started\n\nTo begin web scraping, you'll need to understand the basics of HTML, CSS selectors, and HTTP. This tutorial will guide you through these concepts step by step.\n\n...",
- "text": "Web scraping for beginners\n\nWelcome to our comprehensive web scraping tutorial for beginners. This guide will take you through the fundamentals of extracting data from websites, with practical examples and exercises.\n\nWhat is web scraping?\n\nWeb scraping is the process of extracting data from websites. It involves making HTTP requests to web servers, downloading HTML pages, and parsing them to extract the desired information.\n\nWhy learn web scraping?\n\n- Data collection: Gather information for research, analysis, or business intelligence\n- Automation: Save time by automating repetitive data collection tasks\n- Integration: Connect web data with your applications or databases\n- Monitoring: Track changes on websites automatically\n\nGetting started\n\nTo begin web scraping, you'll need to understand the basics of HTML, CSS selectors, and HTTP. This tutorial will guide you through these concepts step by step.\n\n..."
- }
+ "url": "https://docs.apify.com/academy/web-scraping-for-beginners",
+ "crawl": {
+ "loadedUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
+ "loadedTime": "2025-04-22T14:33:20.514Z",
+ "referrerUrl": "https://docs.apify.com/academy",
+ "depth": 1,
+ "httpStatusCode": 200
+ },
+ "metadata": {
+ "canonicalUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
+ "title": "Web scraping for beginners | Apify Documentation",
+ "description": "Learn the basics of web scraping with a step-by-step tutorial and practical exercises.",
+ "languageCode": "en",
+ "markdown": "# Web scraping for beginners\n\nWelcome to our comprehensive web scraping tutorial for beginners. This guide will take you through the fundamentals of extracting data from websites, with practical examples and exercises.\n\n## What is web scraping?\n\nWeb scraping is the process of extracting data from websites. It involves making HTTP requests to web servers, downloading HTML pages, and parsing them to extract the desired information.\n\n## Why learn web scraping?\n\n- **Data collection**: Gather information for research, analysis, or business intelligence\n- **Automation**: Save time by automating repetitive data collection tasks\n- **Integration**: Connect web data with your applications or databases\n- **Monitoring**: Track changes on websites automatically\n\n## Getting started\n\nTo begin web scraping, you'll need to understand the basics of HTML, CSS selectors, and HTTP. This tutorial will guide you through these concepts step by step.\n\n...",
+ "text": "Web scraping for beginners\n\nWelcome to our comprehensive web scraping tutorial for beginners. This guide will take you through the fundamentals of extracting data from websites, with practical examples and exercises.\n\nWhat is web scraping?\n\nWeb scraping is the process of extracting data from websites. It involves making HTTP requests to web servers, downloading HTML pages, and parsing them to extract the desired information.\n\nWhy learn web scraping?\n\n- Data collection: Gather information for research, analysis, or business intelligence\n- Automation: Save time by automating repetitive data collection tasks\n- Integration: Connect web data with your applications or databases\n- Monitoring: Track changes on websites automatically\n\nGetting started\n\nTo begin web scraping, you'll need to understand the basics of HTML, CSS selectors, and HTTP. This tutorial will guide you through these concepts step by step.\n\n..."
+ }
}
```
diff --git a/sources/platform/integrations/workflows-and-notifications/slack.md b/sources/platform/integrations/workflows-and-notifications/slack.md
index ed46810a76..e4f77bea2c 100644
--- a/sources/platform/integrations/workflows-and-notifications/slack.md
+++ b/sources/platform/integrations/workflows-and-notifications/slack.md
@@ -14,7 +14,6 @@ A tutorial can be found on [Apify Help](https://help.apify.com/en/articles/64540
> Explore the [integration for Slack tutorial](https://help.apify.com/en/articles/6454058-apify-integration-for-slack).
-
[Slack](https://slack.com/) allows you to install various services in your workspace in order to automate and centralize jobs. Apify is one of these services, and it allows you to run your Apify Actors, get notified about their run statuses, and receive your results, all without opening your browser.
## Get started
diff --git a/sources/platform/integrations/workflows-and-notifications/telegram.md b/sources/platform/integrations/workflows-and-notifications/telegram.md
index c69ee337bd..9d47e8d753 100644
--- a/sources/platform/integrations/workflows-and-notifications/telegram.md
+++ b/sources/platform/integrations/workflows-and-notifications/telegram.md
@@ -101,7 +101,6 @@ The best way to do it's to:

-
### Step 2: Create action for your new Telegram bot
Once you've setup your new bot within Zapier, it's time to setup an action.
diff --git a/sources/platform/integrations/workflows-and-notifications/workato.md b/sources/platform/integrations/workflows-and-notifications/workato.md
index 76d52fe334..fda49db53c 100644
--- a/sources/platform/integrations/workflows-and-notifications/workato.md
+++ b/sources/platform/integrations/workflows-and-notifications/workato.md
@@ -28,7 +28,6 @@ The Apify Workato Connector is available in the Workato Community library. Here'
1. Search for **Apify**.
1. Click on the connector and then click **Install**.
-
After installation, the Apify connector appears in **Connector SDK** under the **Tools** tab. After you release the connector, you can use it in your projects.
## Connect your Apify account
@@ -95,11 +94,11 @@ _The Apify connector provides dynamic dropdown lists (pick lists) and flexible i
- **Selection method (pick list vs. manual ID):** Choose from fetched lists or switch to manual and paste an ID. If an item doesn't appear, make sure it exists in your account and has been used at least once, or paste its ID manually.
- Available pick lists:
- - **Actors**: Lists your recently used Actors or Apify Store Actors, displaying the title and username/name
- - **Tasks**: Lists your saved tasks, displaying the task title and Actor name
- - **Datasets**: Lists available datasets, sorted by most recent first
- - **Key-value stores**: Lists available stores, sorted by most recent first
- - **Store Keys**: Dynamically shows keys available in the selected store
+ - **Actors**: Lists your recently used Actors or Apify Store Actors, displaying the title and username/name
+ - **Tasks**: Lists your saved tasks, displaying the task title and Actor name
+ - **Datasets**: Lists available datasets, sorted by most recent first
+ - **Key-value stores**: Lists available stores, sorted by most recent first
+ - **Store Keys**: Dynamically shows keys available in the selected store
### Input types
@@ -120,19 +119,18 @@ Open the Actor or Task Input page in Apify Console, switch format to JSON, and c
When using manual input instead of pick lists, you'll need to provide the correct resource IDs. Here's how to find them in Apify Console:
- **Actor ID**: [Actor detail page](https://console.apify.com/actors) > API panel or URL.
- - Example URL: `https://console.apify.com/actors/`
- - Actor name format: owner~name (for example, `apify~website-scraper`)
+ - Example URL: `https://console.apify.com/actors/`
+ - Actor name format: owner~name (for example, `apify~website-scraper`)
- **Task ID**: [Task detail page](https://console.apify.com/actors/tasks) > API panel or URL.
- - Example URL: `https://console.apify.com/actors/tasks/`
+ - Example URL: `https://console.apify.com/actors/tasks/`
- **Dataset ID**: [Storage > Datasets](https://console.apify.com/storage/datasets) > Dataset detail > API panel or URL.
- - Example URL: `https://console.apify.com/storage/datasets/`
- - Also available in the table on the `Storage > Datasets` page
+ - Example URL: `https://console.apify.com/storage/datasets/`
+ - Also available in the table on the `Storage > Datasets` page
- **Key-value store ID**: [Storage > Key-value stores](https://console.apify.com/storage/Key-value-stores) > Store detail > API panel or URL.
- - Example URL: `https://console.apify.com/storage/Key-value-stores/`
- - Also available in the table on the `Storage > Key-value stores` page
+ - Example URL: `https://console.apify.com/storage/Key-value-stores/`
+ - Also available in the table on the `Storage > Key-value stores` page
- **Webhook ID**: [Actors](https://console.apify.com/actors) > Actor > Integrations.
- - Example URL: `https://console.apify.com/actors//integrations/`
-
+ - Example URL: `https://console.apify.com/actors//integrations/`
## Triggers
@@ -156,7 +154,7 @@ This trigger monitors a specific Apify Actor and starts the recipe when any run

-### Task Run Finished
+### Task Run Finished
_Triggers when an Apify Task run finishes (succeeds, fails, times out, or gets aborted)._
@@ -262,16 +260,16 @@ Provide a single URL and a desired crawler type to get structured scraped data f
Long-running scrapes can exceed typical step execution expectations. Use this asynchronous pattern to keep recipes reliable and scalable.
1. Start the run without waiting
- - In a recipe, add the **Run Actor** action and configure inputs as needed.
- - Run asynchronously (do not block downstream steps on completion).
- - 
+ - In a recipe, add the **Run Actor** action and configure inputs as needed.
+ - Run asynchronously (do not block downstream steps on completion).
+ - 
1. Continue when the run finishes
- - Build a separate recipe with the **Actor Run Finished** trigger.
- - Filter for the specific Actor or Task you started in Step 1.
- - 
+ - Build a separate recipe with the **Actor Run Finished** trigger.
+ - Filter for the specific Actor or Task you started in Step 1.
+ - 
1. Fetch results and process
- - In the triggered recipe, add **Get Dataset Items** (use the dataset ID from the trigger payload) and continue processing.
- - 
+ - In the triggered recipe, add **Get Dataset Items** (use the dataset ID from the trigger payload) and continue processing.
+ - 
## Example use cases
@@ -280,7 +278,7 @@ Long-running scrapes can exceed typical step execution expectations. Use this as
Workato's visual interface makes it easy to connect Apify data with other business applications:
- _Data pills:_ Use output fields from Apify triggers and actions as inputs for subsequent steps
-- _Field mapping:_ Visually map scraped data fields to CRM, database, or spreadsheet columns
+- _Field mapping:_ Visually map scraped data fields to CRM, database, or spreadsheet columns
- _Conditional logic:_ Build workflows that respond differently based on Actor run status or data content
- _Data transformation:_ Apply filters, formatting, and calculations to scraped data before sending to target systems
diff --git a/sources/platform/limits.md b/sources/platform/limits.md
index 8fe52e9a7b..897f0b5bf1 100644
--- a/sources/platform/limits.md
+++ b/sources/platform/limits.md
@@ -137,4 +137,4 @@ The Apify platform also introduces usage limits based on the billing plan to pro
View these limits and adjust your maximum usage limit in [Apify Console](https://console.apify.com/billing#/limits):
-
+
diff --git a/sources/platform/monitoring/index.md b/sources/platform/monitoring/index.md
index 4cb75ef4f0..e7d86c030d 100644
--- a/sources/platform/monitoring/index.md
+++ b/sources/platform/monitoring/index.md
@@ -29,13 +29,13 @@ The monitoring system is free for all users. You can use it to monitor as many A
Currently, the monitoring option offers the following features:
1. Chart showing **statuses** of runs of the Actor or saved task over last 30 days.
- 
+ 
2. Chart displaying **metrics** of the last 200 runs of the Actor or saved task.
- 
+ 
3. Option to set up **alerts** with notifications based on the run metrics.
- 
+ 
> Both charts can also be added to your Apify Console home page so you can quickly see if there are any issues every time you open Apify Console.
@@ -82,7 +82,7 @@ The email and Slack alert notifications both contain the same information. You w
While the in-app notification will contain less information, it will point you directly to the Actor or task that triggered the alert:
-
+
## Other
diff --git a/sources/platform/proxy/datacenter_proxy.md b/sources/platform/proxy/datacenter_proxy.md
index 1cf81cb789..499480107d 100644
--- a/sources/platform/proxy/datacenter_proxy.md
+++ b/sources/platform/proxy/datacenter_proxy.md
@@ -20,14 +20,14 @@ You can refer to our [blog post](https://blog.apify.com/datacenter-proxies-when-
## Features
-* Periodic health checks of proxies in the pool so requests are not forwarded via dead proxies.
-* Intelligent rotation of IP addresses so target hosts are accessed via proxies that have accessed them the longest time ago, to reduce the chance of blocking.
-* Periodically checks whether proxies are banned by selected target websites. If they are, stops forwarding traffic to them to get the proxies unbanned as soon as possible.
-* Ensures proxies are located in specific countries using IP geolocation.
-* Allows selection of groups of proxy servers with specific characteristics.
-* Supports persistent sessions that enable you to keep the same IP address for certain parts of your crawls.
-* Measures statistics of traffic for specific users and hostnames.
-* Allows selection of proxy servers by country.
+- Periodic health checks of proxies in the pool so requests are not forwarded via dead proxies.
+- Intelligent rotation of IP addresses so target hosts are accessed via proxies that have accessed them the longest time ago, to reduce the chance of blocking.
+- Periodically checks whether proxies are banned by selected target websites. If they are, stops forwarding traffic to them to get the proxies unbanned as soon as possible.
+- Ensures proxies are located in specific countries using IP geolocation.
+- Allows selection of groups of proxy servers with specific characteristics.
+- Supports persistent sessions that enable you to keep the same IP address for certain parts of your crawls.
+- Measures statistics of traffic for specific users and hostnames.
+- Allows selection of proxy servers by country.
## Datacenter proxy types
@@ -88,7 +88,6 @@ await Actor.exit();
-
```javascript
@@ -139,7 +138,6 @@ if __name__ == '__main__':
-
```javascript
@@ -185,14 +183,13 @@ This IP/session ID combination is persisted and expires 26 hours later. Each add
If you use the session at least once a day, it will never expire, with two possible exceptions:
-* The proxy server stops responding and is marked as dead during a health check.
-* If the proxy server is part of a proxy group that is refreshed monthly and is rotated out.
+- The proxy server stops responding and is marked as dead during a health check.
+- If the proxy server is part of a proxy group that is refreshed monthly and is rotated out.
If the session is discarded due to the reasons above, it is assigned a new IP address.
To learn more about [sessions](./usage.md#sessions) and [IP address rotation](./usage.md#ip-address-rotation), see the [proxy overview page](./index.md).
-
### Examples using sessions
@@ -214,17 +211,13 @@ const crawler = new PuppeteerCrawler({
},
});
-await crawler.run([
- 'https://proxy.apify.com/?format=json',
- 'https://proxy.apify.com',
-]);
+await crawler.run(['https://proxy.apify.com/?format=json', 'https://proxy.apify.com']);
await Actor.exit();
```
-
```javascript
@@ -280,7 +273,6 @@ if __name__ == '__main__':
-
```javascript
@@ -350,7 +342,6 @@ console.log(data);
-
```python
@@ -377,7 +368,6 @@ print(opener.open("http://proxy.apify.com/?format=json").read())
-
```python
@@ -401,7 +391,6 @@ print(opener.open("http://proxy.apify.com/?format=json").read())
-
```php
@@ -420,7 +409,6 @@ if ($response) echo $response;
-
```php
diff --git a/sources/platform/proxy/google_serp_proxy.md b/sources/platform/proxy/google_serp_proxy.md
index ddc39a0f7b..2440026ae8 100644
--- a/sources/platform/proxy/google_serp_proxy.md
+++ b/sources/platform/proxy/google_serp_proxy.md
@@ -16,9 +16,9 @@ Google SERP proxy allows you to extract search results from Google Search-powere
Our Google SERP proxy currently supports the below services.
-* Google Search (`http://www.google./search`).
-* Google Shopping (`http://www.google./shopping/product/`).
-* Google Shopping Search (`http://www.google./search?tbm=shop`).
+- Google Search (`http://www.google./search`).
+- Google Shopping (`http://www.google./shopping/product/`).
+- Google Shopping Search (`http://www.google./search?tbm=shop`).
> Google SERP proxy can **only** be used for Google Search and Shopping. It cannot be used to access other websites.
@@ -52,10 +52,9 @@ You must use the correct Google domain to get results for your desired country c
For example:
-* Search results from the USA: `http://www.google.com/search?q=`
+- Search results from the USA: `http://www.google.com/search?q=`
-
-* Shopping results from Great Britain: `http://www.google.co.uk/seach?tbm=shop&q=`
+- Shopping results from Great Britain: `http://www.google.co.uk/seach?tbm=shop&q=`
See a [full list](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/List_of_Google_domains.html) of available domain names for specific countries. When using them, remember to prepend the domain name with the `www.` prefix.
@@ -186,7 +185,6 @@ console.log(data);
-
```python
@@ -210,7 +208,6 @@ print(opener.open(f"http://www.google.com/search?{query}").read())
-
```python
@@ -238,7 +235,6 @@ print(opener.open(url).read())
-
```php
diff --git a/sources/platform/proxy/index.md b/sources/platform/proxy/index.md
index 1ae1bf2e06..e7114fd375 100644
--- a/sources/platform/proxy/index.md
+++ b/sources/platform/proxy/index.md
@@ -21,7 +21,6 @@ You can use proxies in your [Actors](../actors/index.mdx) or any other applicati
You can view your proxy settings and password on the [Proxy](https://console.apify.com/proxy) page in Apify Console. For pricing information, visit [apify.com/pricing](https://apify.com/pricing).
-
## Quickstart
Usage of Apify Proxy means just a couple of lines of code, thanks to our [SDKs](/sdk):
@@ -97,4 +96,3 @@ Several types of proxy servers exist, each offering distinct advantages, disadva
to="/platform/proxy/google-serp-proxy"
/>
-
diff --git a/sources/platform/proxy/usage.md b/sources/platform/proxy/usage.md
index 0203aeb2f4..1b2e82ddb2 100644
--- a/sources/platform/proxy/usage.md
+++ b/sources/platform/proxy/usage.md
@@ -20,7 +20,9 @@ http://:@:
```
:::caution
+
All usage of Apify Proxy with your password is charged towards your account. Do not share the password with untrusted parties or use it from insecure networks, as **the password is sent unencrypted** due to the HTTP protocol's [limitations](https://www.guru99.com/difference-http-vs-https.html).
+
:::
### External connection
@@ -28,15 +30,17 @@ All usage of Apify Proxy with your password is charged towards your account. Do
If you want to connect to Apify Proxy from outside of the Apify Platform, you need to have a paid Apify plan (to prevent abuse).
If you need to test Apify Proxy before you subscribe, please [contact our support](https://apify.com/contact).
-| Parameter | Value / explanation |
-|---------------------|---------------------|
-| Hostname | `proxy.apify.com`|
-| Port | `8000` |
-| Username | Specifies the proxy parameters such as groups, [session](#sessions) and location. See [username parameters](#username-parameters) below for details. **Note**: this is not your Apify username.|
-| Password | Apify Proxy password. Your password is displayed on the [Proxy](https://console.apify.com/proxy/groups) page in Apify Console. **Note**: this is not your Apify account password. |
+| Parameter | Value / explanation |
+| --------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Hostname | `proxy.apify.com` |
+| Port | `8000` |
+| Username | Specifies the proxy parameters such as groups, [session](#sessions) and location. See [username parameters](#username-parameters) below for details. **Note**: this is not your Apify username. |
+| Password | Apify Proxy password. Your password is displayed on the [Proxy](https://console.apify.com/proxy/groups) page in Apify Console. **Note**: this is not your Apify account password. |
:::caution
+
If you use these connection parameters for connecting to Apify Proxy from your Actors running on the Apify Platform, the connection will still be considered external, it will not work on the Free plan, and on paid plans you will be charged for external data transfer. Please use the connection parameters from the [Connection from Actors](#connection-from-actors) section when using Apify Proxy from Actors.
+
:::
Example connection string for external connections:
@@ -52,12 +56,12 @@ If you want to connect to Apify Proxy from Actors running on the Apify Platform,
If you don't want to use these helpers, and want to connect to Apify Proxy manually, you can find the right configuration values in [environment variables](../actors/development/programming_interface/environment_variables.md) provided to the Actor.
By using this configuration, you ensure that you connect to Apify Proxy directly through the Apify infrastructure, bypassing any external connection via the Internet, thereby improving the connection speed, and ensuring you don't pay for external data transfer.
-| Parameter | Source / explanation |
-|---------------------|---------------------|
-| Hostname | `APIFY_PROXY_HOSTNAME` environment variable |
-| Port | `APIFY_PROXY_PORT` environment variable |
-| Username | Specifies the proxy parameters such as groups, [session](#sessions) and location. See [username parameters](#username-parameters) below for details. **Note**: this is not your Apify username.|
-| Password | `APIFY_PROXY_PASSWORD` environment variable |
+| Parameter | Source / explanation |
+| --------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Hostname | `APIFY_PROXY_HOSTNAME` environment variable |
+| Port | `APIFY_PROXY_PORT` environment variable |
+| Username | Specifies the proxy parameters such as groups, [session](#sessions) and location. See [username parameters](#username-parameters) below for details. **Note**: this is not your Apify username. |
+| Password | `APIFY_PROXY_PASSWORD` environment variable |
Example connection string creation:
@@ -122,15 +126,15 @@ If you want to specify one parameter and not the others, just provide that param
We have code examples for connecting to our proxy using the [Apify SDK](/sdk) and [Crawlee](https://crawlee.dev/) and other libraries, as well as examples in PHP.
-* [Datacenter proxy](./datacenter_proxy.md#examples)
-* [Residential proxy](./residential_proxy.md#connecting-to-residential-proxy)
-* [Google SERP proxy](./google_serp_proxy.md#examples)
+- [Datacenter proxy](./datacenter_proxy.md#examples)
+- [Residential proxy](./residential_proxy.md#connecting-to-residential-proxy)
+- [Google SERP proxy](./google_serp_proxy.md#examples)
For code examples related to proxy management in Apify SDK and Crawlee, see:
-* [Apify SDK JavaScript](/sdk/js/docs/guides/proxy-management)
-* [Apify SDK Python](/sdk/python/docs/concepts/proxy-management)
-* [Crawlee](https://crawlee.dev/docs/guides/proxy-management)
+- [Apify SDK JavaScript](/sdk/js/docs/guides/proxy-management)
+- [Apify SDK Python](/sdk/python/docs/concepts/proxy-management)
+- [Crawlee](https://crawlee.dev/docs/guides/proxy-management)
## IP address rotation {#ip-address-rotation}
@@ -138,8 +142,8 @@ Web scrapers can rotate the IP addresses they use to access websites. They assig
Depending on whether you use a [browser](https://apify.com/apify/web-scraper) or [HTTP requests](https://apify.com/apify/cheerio-scraper) for your scraping jobs, IP address rotation works differently.
-* Browser—a different IP address is used for each browser.
-* HTTP request—a different IP address is used for each request.
+- Browser—a different IP address is used for each browser.
+- HTTP request—a different IP address is used for each request.
Use [sessions](#sessions) to control how you rotate IP addresses. See our guide [Anti-scraping techniques](/academy/anti-scraping/techniques) to learn more about IP address rotation and our findings on how blocking works.
@@ -162,8 +166,8 @@ You can see which proxy groups you have access to on the [Proxy page](https://co
If you need to allow communication to `apify.proxy.com`, add the following IP addresses to your firewall rule or whitelist:
-* `18.208.102.16`
-* `35.171.134.41`
+- `18.208.102.16`
+- `35.171.134.41`
## Troubleshooting
@@ -179,24 +183,24 @@ https://api.apify.com/v2/browser-info/
Sometimes when the `502` status code is not comprehensive enough. Therefore, we have modified our server with `590-599` codes instead to provide more insight:
-* `590 Non Successful`: upstream responded with non-200 status code.
-* `591 RESERVED`: *this status code is reserved for further use.*
-* `592 Status Code Out Of Range`: upstream responded with status code different than 100–999.
-* `593 Not Found`: DNS lookup failed, indicating either [`EAI_NODATA`](https://github.com/libuv/libuv/blob/cdbba74d7a756587a696fb3545051f9a525b85ac/include/uv.h#L82) or [`EAI_NONAME`](https://github.com/libuv/libuv/blob/cdbba74d7a756587a696fb3545051f9a525b85ac/include/uv.h#L83).
-* `594 Connection Refused`: upstream refused connection.
-* `595 Connection Reset`: connection reset due to loss of connection or timeout.
-* `596 Broken Pipe`: trying to write on a closed socket.
-* `597 Auth Failed`: incorrect upstream credentials.
-* `598 RESERVED`: *this status code is reserved for further use.*
-* `599 Upstream Error`: generic upstream error.
+- `590 Non Successful`: upstream responded with non-200 status code.
+- `591 RESERVED`: _this status code is reserved for further use._
+- `592 Status Code Out Of Range`: upstream responded with status code different than 100–999.
+- `593 Not Found`: DNS lookup failed, indicating either [`EAI_NODATA`](https://github.com/libuv/libuv/blob/cdbba74d7a756587a696fb3545051f9a525b85ac/include/uv.h#L82) or [`EAI_NONAME`](https://github.com/libuv/libuv/blob/cdbba74d7a756587a696fb3545051f9a525b85ac/include/uv.h#L83).
+- `594 Connection Refused`: upstream refused connection.
+- `595 Connection Reset`: connection reset due to loss of connection or timeout.
+- `596 Broken Pipe`: trying to write on a closed socket.
+- `597 Auth Failed`: incorrect upstream credentials.
+- `598 RESERVED`: _this status code is reserved for further use._
+- `599 Upstream Error`: generic upstream error.
The typical issues behind these codes are:
-* `590` and `592` indicate an issue on the upstream side.
-* `593` indicates an incorrect `proxy-chain` configuration.
-* `594`, `595` and `596` may occur due to connection loss.
-* `597` indicates incorrect upstream credentials.
-* `599` is a generic error, where the above is not applicable.
+- `590` and `592` indicate an issue on the upstream side.
+- `593` indicates an incorrect `proxy-chain` configuration.
+- `594`, `595` and `596` may occur due to connection loss.
+- `597` indicates incorrect upstream credentials.
+- `599` is a generic error, where the above is not applicable.
- Note that the Apify Proxy is based on the [proxy-chain](https://github.com/apify/proxy-chain) open-source `npm` package developed and maintained by Apify.
- You can find the details of the above errors and their implementation there.
+ Note that the Apify Proxy is based on the [proxy-chain](https://github.com/apify/proxy-chain) open-source `npm` package developed and maintained by Apify.
+ You can find the details of the above errors and their implementation there.
diff --git a/sources/platform/quick-start/start_locally.md b/sources/platform/quick-start/start_locally.md
index aaed7069e1..483a82ae0d 100644
--- a/sources/platform/quick-start/start_locally.md
+++ b/sources/platform/quick-start/start_locally.md
@@ -39,9 +39,9 @@ The CLI will ask you to:
3. Select a development template
:::info Explore Actor templates
- Browse the [full list of templates](https://apify.com/templates) to find the best fit for your Actor.
+ Browse the [full list of templates](https://apify.com/templates) to find the best fit for your Actor.
- :::
+ :::
The CLI will:
@@ -85,11 +85,13 @@ In the next step, we’ll explore the results in more detail.
### Step 3: Explore the Actor
Let's explore the Actor structure.
+
-#### The `.actor` folder
+#### The `.actor` folder
The `.actor` folder contains the Actor configuration. The `actor.json` file defines the Actor's name, description, and other settings. Find more info in the [actor.json](https://docs.apify.com/platform/actors/development/actor-definition/actor-json) definition.
+
#### Actor's `input`
diff --git a/sources/platform/quick-start/start_web_ide.md b/sources/platform/quick-start/start_web_ide.md
index 6f22461e52..2885e93ff7 100644
--- a/sources/platform/quick-start/start_web_ide.md
+++ b/sources/platform/quick-start/start_web_ide.md
@@ -100,16 +100,16 @@ Install [apify-cli](https://docs.apify.com/cli/) :
- ```bash
- brew install apify-cli
- ```
+```bash
+brew install apify-cli
+```
- ```bash
- npm -g install apify-cli
- ```
+```bash
+npm -g install apify-cli
+```
@@ -137,7 +137,6 @@ To pull your Actor:
```
As `your-actor-name`, you can use either:
-
- The unique name of the Actor (e.g., `apify/hello-world`)
- The ID of the Actor (e.g., `E2jjCZBezvAZnX8Rb`)
diff --git a/sources/platform/schedules.md b/sources/platform/schedules.md
index ef13cc0048..58b68fe549 100644
--- a/sources/platform/schedules.md
+++ b/sources/platform/schedules.md
@@ -13,21 +13,25 @@ slug: /schedules
Schedules allow you to run your Actors and tasks at specific times. You schedule the run frequency using [cron expressions](#cron-expressions).
:::note Timezone & Daylight Savings Time
+
Schedules allow timezone settings and support daylight saving time shifts (DST).
+
:::
You can set up and manage your Schedules using:
-* [Apify Console](https://console.apify.com/schedules)
-* [Apify API](/api/v2/schedules)
-* [JavaScript API client](https://docs.apify.com/api/client/js/reference/class/ScheduleClient)
-* [Python API client](https://docs.apify.com/api/client/python/reference/class/ScheduleClient)
+- [Apify Console](https://console.apify.com/schedules)
+- [Apify API](/api/v2/schedules)
+- [JavaScript API client](https://docs.apify.com/api/client/js/reference/class/ScheduleClient)
+- [Python API client](https://docs.apify.com/api/client/python/reference/class/ScheduleClient)
When scheduling a new Actor or task run, you can override its input settings using a JSON object similarly to when invoking an Actor or task using the [Apify REST API](/api/v2/schedules).
:::note Events Startup Variability
+
In most cases, scheduled events are fired within one second of their scheduled time.
However, runs can be delayed because of a system overload or a server shutting down.
+
:::
Each schedule can be associated with a maximum of _10_ Actors and _10_ Actor tasks.
@@ -39,7 +43,9 @@ Before setting up a new schedule, you should have the [Actor](./actors/index.mdx
To schedule an Actor, you need to have run it at least once before. To run the Actor, navigate to the Actor's page through [Apify Console](https://console.apify.com/store), where you can configure and initiate the Actor's run with your preferred settings by clicking the **Start** button. After this initial run, you can then use Schedules to automate future runs.
:::info Name Length
+
Your schedule's name should be 3–63 characters long.
+
:::
### Apify Console
@@ -57,7 +63,7 @@ Next, you'll need to give the schedule something to run. This is where the Actor
If you're scheduling an Actor run, you'll be able to specify the Actor's [input](./actors/running/input_and_output.md) and running options like [build](./actors/development/builds_and_runs/builds.md), timeout, [memory](./actors/running/usage_and_resources.md).
The **timeout** value is specified in seconds; a value of _0_ means there is no timeout, and the Actor runs until it finishes.
- If you don't provide an input, then the Actor's default input is used. If you provide an input with some fields missing, the missing fields are filled in with values from the default input. If input options are not provided, the default options values are used.
+If you don't provide an input, then the Actor's default input is used. If you provide an input with some fields missing, the missing fields are filled in with values from the default input. If input options are not provided, the default options values are used.

@@ -78,7 +84,9 @@ To create a new [schedule](/api/v2/schedules) using the Apify API, send a `POST`
You can find your [secret API token](./integrations/index.mdx) under the [Integrations](https://console.apify.com/account?tab=integrations) tab of your Apify account settings.
:::caution API authentication recommendations
+
When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL ([more info](/api/v2#authentication)).
+
:::
In the `POST` request's payload should be a JSON object specifying the schedule's name, your [user ID](https://console.apify.com/account#/integrations), and the schedule's _actions_.
@@ -135,13 +143,13 @@ If you want to manage the notifications for your schedules in bulk, you can do t
A cron expression has the following structure:
| Position | Field | Values | Wildcards | Optional |
-|:---------|:-------------|:-------------------------------|:----------|:---------|
-| 1 | second | 0 - 59 | , - * / | yes |
-| 2 | minute | 0 - 59 | , - * / | no |
-| 3 | hour | 0 - 23 | , - * / | no |
-| 4 | day of month | 1 - 31 | , - * / | no |
-| 5 | month | 1 - 12 | , - * / | no |
-| 6 | day of week | 0 - 7 (0 or 7 is Sunday) | , - * / | no |
+| :------- | :----------- | :----------------------------- | :-------- | :------- |
+| 1 | second | 0 - 59 | , - \* / | yes |
+| 2 | minute | 0 - 59 | , - \* / | no |
+| 3 | hour | 0 - 23 | , - \* / | no |
+| 4 | day of month | 1 - 31 | , - \* / | no |
+| 5 | month | 1 - 12 | , - \* / | no |
+| 6 | day of week | 0 - 7 (0 or 7 is Sunday) | , - \* / | no |
For example, the expression `30 5 16 * * 1` will start an Actor at 16:05:30 every Monday.
@@ -149,15 +157,15 @@ The minimum interval between runs is 10 seconds; if your next run is scheduled s
### Examples of cron expressions
-* `0 8 * * *` - every day at 8 AM.
-* `0 0 * * 0` - every 7 days (at 00:00 on Sunday).
-* `*/3 * * * *` - every 3rd minute.
-* `0 0 1 */2 *` - every other month (at 00:00 on the first day of month, every 2nd month).
+- `0 8 * * *` - every day at 8 AM.
+- `0 0 * * 0` - every 7 days (at 00:00 on Sunday).
+- `*/3 * * * *` - every 3rd minute.
+- `0 0 1 */2 *` - every other month (at 00:00 on the first day of month, every 2nd month).
Additionally, you can use the following shortcut expressions:
-* `@yearly` = `0 0 1 1 *` - once a year, on Jan 1st at midnight.
-* `@monthly` = `0 0 1 * *` - once a month, on the 1st at midnight.
-* `@weekly` = `0 0 * * 0` - once a week, on Sunday at midnight.
-* `@daily` = `0 0 * * *` - run once a day, at midnight.
-* `@hourly` = `0 * * * *` - on the hour, every hour.
+- `@yearly` = `0 0 1 1 *` - once a year, on Jan 1st at midnight.
+- `@monthly` = `0 0 1 * *` - once a month, on the 1st at midnight.
+- `@weekly` = `0 0 * * 0` - once a week, on Sunday at midnight.
+- `@daily` = `0 0 * * *` - run once a day, at midnight.
+- `@hourly` = `0 * * * *` - on the hour, every hour.
diff --git a/sources/platform/storage/index.md b/sources/platform/storage/index.md
index 5682d4a8f7..41b742dcc1 100644
--- a/sources/platform/storage/index.md
+++ b/sources/platform/storage/index.md
@@ -16,7 +16,6 @@ import StoragePricingCalculator from "@site/src/components/StoragePricingCalcula
The Apify platform provides three types of storage accessible both within our [Apify Console](https://console.apify.com/storage) and externally through our [REST API](/api/v2) [Apify API Clients](/api) or [SDKs](/sdk).
-
-
diff --git a/sources/platform/storage/key_value_store.md b/sources/platform/storage/key_value_store.md
index cd85895d65..428c64a8fb 100644
--- a/sources/platform/storage/key_value_store.md
+++ b/sources/platform/storage/key_value_store.md
@@ -41,14 +41,12 @@ To view a key-value store's content, click on its **Store ID**.
Under the **Actions** menu, you can rename your store (and, in turn extend its [retention period](/platform/storage/usage#named-and-unnamed-storages)) and grant [access rights](../collaboration/index.md) using the **Share** button.
Click on the **API** button to view and test a store's [API endpoints](/api/v2/storage-key-value-stores).
-

On the bottom of the page, you can view, download, and delete the individual records.

-
### Apify API
The [Apify API](/api/v2/storage-key-value-stores) enables you programmatic access to your key-value stores using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods).
@@ -107,9 +105,7 @@ The Apify [JavaScript API client](/api/client/js/reference/class/KeyValueStoreCl
After importing and initiating the client, you can save each key-value store to a variable for easier access.
```js
-const myKeyValStoreClient = apifyClient.keyValueStore(
- 'jane-doe/my-key-val-store',
-);
+const myKeyValStoreClient = apifyClient.keyValueStore('jane-doe/my-key-val-store');
```
You can then use that variable to [access the key-value store's items and manage it](/api/client/js/reference/class/KeyValueStoreClient).
diff --git a/sources/platform/storage/request_queue.md b/sources/platform/storage/request_queue.md
index 32c0fa87b4..0c6e7e05f5 100644
--- a/sources/platform/storage/request_queue.md
+++ b/sources/platform/storage/request_queue.md
@@ -464,28 +464,21 @@ await requestQueueClient.batchAddRequests([
]);
// Locks the first two requests at the head of the queue.
-const processingRequestsClientOne = await requestQueueClient.listAndLockHead(
- {
- limit: 2,
- lockSecs: 120,
- },
-);
+const processingRequestsClientOne = await requestQueueClient.listAndLockHead({
+ limit: 2,
+ lockSecs: 120,
+});
// Checks when the lock will expire. The locked request will have a lockExpiresAt attribute.
const lockedRequest = processingRequestsClientOne.items[0];
-const lockedRequestDetail = await requestQueueClient.getRequest(
- lockedRequest.id,
-);
+const lockedRequestDetail = await requestQueueClient.getRequest(lockedRequest.id);
console.log(`Request locked until ${lockedRequestDetail?.lockExpiresAt}`);
// Prolongs the lock of the first request or unlocks it.
-await requestQueueClient.prolongRequestLock(
- lockedRequest.id,
- { lockSecs: 120 },
-);
-await requestQueueClient.deleteRequestLock(
- lockedRequest.id,
-);
+await requestQueueClient.prolongRequestLock(lockedRequest.id, {
+ lockSecs: 120,
+});
+await requestQueueClient.deleteRequestLock(lockedRequest.id);
await Actor.exit();
```
@@ -514,31 +507,32 @@ const requestQueueClient = client.requestQueue(requestQueue.id, {
// Get all requests from the queue and check one locked by the first Actor.
const requests = await requestQueueClient.listRequests();
-const requestsLockedByAnotherRun = requests.items.filter((request) => request.lockByClient === 'requestqueueone');
+const requestsLockedByAnotherRun = requests.items.filter(
+ (request) => request.lockByClient === 'requestqueueone',
+);
const requestLockedByAnotherRunDetail = await requestQueueClient.getRequest(
requestsLockedByAnotherRun[0].id,
);
// Other clients cannot list and lock these requests; the listAndLockHead call returns other requests from the queue.
-const processingRequestsClientTwo = await requestQueueClient.listAndLockHead(
- {
- limit: 10,
- lockSecs: 60,
- },
-);
+const processingRequestsClientTwo = await requestQueueClient.listAndLockHead({
+ limit: 10,
+ lockSecs: 60,
+});
const wasBothRunsLockedSameRequest = !!processingRequestsClientTwo.items.find(
(request) => request.id === requestLockedByAnotherRunDetail.id,
);
-console.log(`Was the request locked by the first run locked by the second run? ${wasBothRunsLockedSameRequest}`);
+console.log(
+ `Was the request locked by the first run locked by the second run? ${wasBothRunsLockedSameRequest}`,
+);
console.log(`Request locked until ${requestLockedByAnotherRunDetail?.lockExpiresAt}`);
// Other clients cannot modify the lock; attempting to do so will throw an error.
try {
- await requestQueueClient.prolongRequestLock(
- requestLockedByAnotherRunDetail.id,
- { lockSecs: 60 },
- );
+ await requestQueueClient.prolongRequestLock(requestLockedByAnotherRunDetail.id, {
+ lockSecs: 60,
+ });
} catch (err) {
// This will throw an error.
}
diff --git a/sources/platform/storage/usage.md b/sources/platform/storage/usage.md
index ac55550620..b44116d5fa 100644
--- a/sources/platform/storage/usage.md
+++ b/sources/platform/storage/usage.md
@@ -24,7 +24,6 @@ The [key-value store](./key_value_store.md) is ideal for saving data records suc

-
## Request queue
[Request queues](./request_queue.md) allow you to dynamically maintain a queue of URLs of web pages. You can use this when recursively crawling websites: you start from initial URLs and add new links as they are found while skipping duplicates.
@@ -35,10 +34,10 @@ The [key-value store](./key_value_store.md) is ideal for saving data records suc
You can access your storage in several ways:
-* [Apify Console](https://console.apify.com/storage) - provides an easy-to-use interface.
-* [Apify API](/api/v2/storage-key-value-stores) - to access your storages programmatically.
-* [API clients](/api) - to access your storages from any Node.js/Python application.
-* [Apify SDKs](/sdk) - when building your own JavaScript/Python Actor.
+- [Apify Console](https://console.apify.com/storage) - provides an easy-to-use interface.
+- [Apify API](/api/v2/storage-key-value-stores) - to access your storages programmatically.
+- [API clients](/api) - to access your storages from any Node.js/Python application.
+- [Apify SDKs](/sdk) - when building your own JavaScript/Python Actor.
### Apify Console
@@ -55,7 +54,9 @@ Additionally, you can quickly share the contents and details of your storage by

+
These URLs link to API _endpoints_—the places where your data is stored. Endpoints that allow you to _read_ stored information do not require an [authentication token](/api/v2#authentication). Calls are authenticated using a hard-to-guess ID, allowing for secure sharing. However, operations such as _update_ or _delete_ require the authentication token.
+
> Never share a URL containing your authentication token, to avoid compromising your account's security.
@@ -67,9 +68,9 @@ The [Apify API](/api/v2/storage-key-value-stores) allows you to access your stor
In most cases, when accessing your storages via API, you will need to provide a `store ID`, which you can do in the following formats:
-* `WkzbQMuFYuamGv3YF` - the store's alphanumerical ID if the store is unnamed.
-* `~store-name` - the store's name prefixed with tilde (`~`) character if the store is named (e.g. `~ecommerce-scraping-results`)
-* `username~store-name` - username and the store's name separated by a tilde (`~`) character if the store is named and belongs to a different account (e.g. `janedoe~ecommerce-scraping-results`). Note that in this case, the store's owner needs to grant you access first.
+- `WkzbQMuFYuamGv3YF` - the store's alphanumerical ID if the store is unnamed.
+- `~store-name` - the store's name prefixed with tilde (`~`) character if the store is named (e.g. `~ecommerce-scraping-results`)
+- `username~store-name` - username and the store's name separated by a tilde (`~`) character if the store is named and belongs to a different account (e.g. `janedoe~ecommerce-scraping-results`). Note that in this case, the store's owner needs to grant you access first.
For read (GET) requests, it is enough to use a store's alphanumerical ID, since the ID is hard to guess and effectively serves as an authentication key.
@@ -87,8 +88,8 @@ You can visit [API Clients](/api) documentations for more information.
The Apify SDKs are libraries in JavaScript or Python that provide tools for building your own Actors.
-* JavaScript SDK requires [Node.js](https://nodejs.org/en/) 16 or later.
-* Python SDK requires [Python](https://www.python.org/downloads/release/python-380/) 3.8 or above.
+- JavaScript SDK requires [Node.js](https://nodejs.org/en/) 16 or later.
+- Python SDK requires [Python](https://www.python.org/downloads/release/python-380/) 3.8 or above.
## Estimate your costs
@@ -109,12 +110,12 @@ Use this tool to estimate storage costs by plan and storage type.
All API endpoints limit their rate of requests to protect Apify servers from overloading. The default rate limit for storage objects is _60 requests per second_. However, there are exceptions limited to _400 requests per second_ per storage object, including:
-* [Push items](/api/v2/dataset-items-post) to dataset.
-* CRUD ([add](/api/v2/request-queue-requests-post),
-[get](/api/v2/request-queue-request-get),
-[update](/api/v2/request-queue-request-put),
-[delete](/api/v2/request-queue-request-delete))
-operations of _request queue_ requests.
+- [Push items](/api/v2/dataset-items-post) to dataset.
+- CRUD ([add](/api/v2/request-queue-requests-post),
+ [get](/api/v2/request-queue-request-get),
+ [update](/api/v2/request-queue-request-put),
+ [delete](/api/v2/request-queue-request-delete))
+ operations of _request queue_ requests.
If a client exceeds this limit, the API endpoints respond with the HTTP status code `429 Too Many Requests` and the following body:
@@ -145,8 +146,8 @@ To name your storage via API, get its ID from the run that generated it using th
Our SDKs and clients each have unique naming conventions for storages. For more information check out documentation:
-* [SDKs](/sdk)
-* [API Clients](/api)
+- [SDKs](/sdk)
+- [API Clients](/api)
## Named and unnamed storages
@@ -195,21 +196,21 @@ Learn how restricted access works in [General resource access](/platform/collabo
Named storages are only removed upon your request.
You can delete storages in the following ways:
-* [Apify Console](https://console.apify.com/storage) - using the **Actions** button in the store's detail page.
-* [JavaScript SDK](/sdk/js) - using the `.drop()` method of the
+- [Apify Console](https://console.apify.com/storage) - using the **Actions** button in the store's detail page.
+- [JavaScript SDK](/sdk/js) - using the `.drop()` method of the
[Dataset](/sdk/js/api/apify/class/Dataset#drop),
[Key-value store](/sdk/js/api/apify/class/KeyValueStore#drop),
or [Request queue](/sdk/js/api/apify/class/RequestQueue#drop) class.
-* [Python SDK](/sdk/python) - using the `.drop()` method of the
+- [Python SDK](/sdk/python) - using the `.drop()` method of the
[Dataset](/sdk/python/reference/class/Dataset#drop),
[Key-value store](/sdk/python/reference/class/KeyValueStore#drop),
or [Request queue](/sdk/python/reference/class/RequestQueue#drop) class.
-* [JavaScript API client](/api/client/js) - using the `.delete()` method in the
+- [JavaScript API client](/api/client/js) - using the `.delete()` method in the
[dataset](/api/client/js/reference/class/DatasetClient),
[key-value store](/api/client/js/reference/class/KeyValueStoreClient),
or [request queue](/api/client/js/reference/class/RequestQueueClient) clients.
-* [Python API client](/api/client/python) - using the `.delete()` method in the
+- [Python API client](/api/client/python) - using the `.delete()` method in the
[dataset](/api/client/python#datasetclient),
[key-value store](/api/client/python/reference/class/KeyValueStoreClient),
or [request queue](/api/client/python/reference/class/RequestQueueClient) clients.
-* [API](/api/v2/key-value-store-delete) using the - `Delete [store]` endpoint, where `[store]` is the type of storage you want to delete.
+- [API](/api/v2/key-value-store-delete) using the - `Delete [store]` endpoint, where `[store]` is the type of storage you want to delete.