-
Notifications
You must be signed in to change notification settings - Fork 0
[#21] [Backend] As a system, I can scrap data from Google search result page #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…ge' of github.com:mosharaf13/google-scrapper-ruby into backend/scrap-google-search-page
|
Code coverage is now at 0.00% (0/204 lines) Generated by 🚫 Danger |
sanG-github
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some early suggestions 😉 Please take a look at your code to re-format it, I saw some empty lines in the service. 🙇
Hope that is useful for you.
| class SearchKeywordJob < ApplicationJob | ||
| queue_as :default | ||
|
|
||
| def perform(search_stat_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should better make the call explicit with the keyword arguments.
| def perform(search_stat_id) | |
| def perform(search_stat_id:) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 45c5304
| end | ||
|
|
||
| def ads_page_urls | ||
| document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need to return true/false.
| document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence } | |
| document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].present? } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverting back to presence here 57afc7e as present? returns a collection of only true values, not actual urls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that, we can use filter with present?,
| document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence } | |
| document.css(".#{ADWORDS_CLASS}").filter { |a_tag| a_tag['href'].present? } |
| status { rand(1..3) } | ||
| raw_response { FFaker::HTMLIpsum.body } | ||
| user_id { demo_user.id } | ||
| user { User.create(email: 'user@demo.com', password: 'Secret@11') } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we do it explicitly as previously? 💭
| user { User.create(email: 'user@demo.com', password: 'Secret@11') } | |
| user { demo_user } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's specify the class name explicitly
Fabricator(:search_stat, class_name: SearchStat) doThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need to use a specific user for the search_stat, how about generating randomly, e.g.
Fabricator(:search_stat, class_name: SearchStat) do
...
user { Fabricate(:user) }
endwhen you need to use the fabricator with a specific user, e.g. in the Seed file, it can be done like this
john = User.find(1)
search_stat = Fabricate(:search_stat, user: john)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that, I thought we were in the seeds file. How about using Fabricate instead of creating a new one manually?
| user { User.create(email: 'user@demo.com', password: 'Secret@11') } | |
| user { Fabricate(:user) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 5c5eaad
|
|
||
| html_result = Google::ClientService.new(keyword: search_stat.keyword).call | ||
|
|
||
| raise ClientServiceError unless html_result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does NOT make total sense as the returned data can be blank, right? 🙊
IMO, I think we can separate it into a method, and then use the begin/rescue to catch the errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 122de2f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mosharaf13 Don't know why the changes don't reflect here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure about that. This is how the job looks currently.
…per-ruby into backend/scrap-google-search-page
Co-authored-by: Sang Huynh Thanh <63148598+sanG-github@users.noreply.github.com>
sanG-github
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions, please update tests for the missing service.
| def perform(search_stat_id:) | ||
| search_stat = SearchStat.find search_stat_id | ||
| html_result = fetch_html_result(search_stat.keyword) | ||
| update_search_stat search_stat, ParserService.new(html_response: html_result).call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more explicit and readable.
First, we call the service and assign the returned data to the variable, then, we pass the variable to the update method.
| update_search_stat search_stat, ParserService.new(html_response: html_result).call | |
| parsed_attributes = ParserService.new(html_response: html_result).call | |
| update_search_stat(search_stat, parsed_attributes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 40d3740
| queue_as :default | ||
|
|
||
| def perform(search_stat_id:) | ||
| search_stat = SearchStat.find search_stat_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the SearchStat is not found? 🤔 (Maybe it was deleted before the job ran)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really a possibility 🤔? Should this case be handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure Mosharaf, we must handle this case. 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in b1fde06
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Rails.logger.error("Error while fetching HTML result: #{e.message}") | ||
| raise ClientServiceError, 'Error fetching HTML result' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never reach that codes as we already rescue the errors in the ClientService and then return false.
Otherwise, if there is another error that is out of HTTParty::Error, Timeout::Error, SocketError, we can also catch it in the ClientService.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, should this section be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is another error that is out of HTTParty::Error, Timeout::Error, SocketError, we can also catch it in the ClientService.
We can remove it in this service, but please handle it in the ClientService alternately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 582a136
| # rubocop:disable Rails/SkipsModelValidations | ||
| search_stat.result_links.insert_all attributes[:result_links] | ||
| # rubocop:enable Rails/SkipsModelValidations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to disable this rule, usually we find a way to work around this instead of disabling the rule each time we faced it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should better use create instead of insert_all, then we won't miss any validations as we don't want to create a record with an url.
validates :url, presence: trueThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 8051e7c
| # Inspect Http response status code | ||
| # Any non 200 response code will be logged | ||
| def valid_result?(result) | ||
| return true if result&.response&.code == '200' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can early return false if there is no result, so we don't need the safe navigator anymore. ✨
| return true if result&.response&.code == '200' | |
| return false unless result | |
| return true if result.response.code == SUCCESS_STATUS_CODE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please define a constant for SUCCESS_STATUS_CODE = '200'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in bce3b3e
| def valid_result?(result) | ||
| return true if result&.response&.code == '200' | ||
|
|
||
| Rails.logger.warn "Warning: Query Google with '#{@escaped_keyword}' return status code #{result.response.code}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not belongs to the valid_result? method, we only expect it returns the valid? property of the result, not to write the log. (We can move it to the call method)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in bce3b3e
| AD_CONTAINER_ID = 'tads' | ||
| ADWORDS_CLASS = 'adwords' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that if we use ID or CLASS, we need to manually add a corresponding prefix # or . for it.
So, what do you think about defining it as a SELECTOR as you have done with NON_ADS_RESULT_SELECTOR
| AD_CONTAINER_ID = 'tads' | |
| ADWORDS_CLASS = 'adwords' | |
| AD_CONTAINER_ID = '#tads' | |
| ADWORDS_CLASS = '.adwords' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Add a class to all AdWords link for easier manipulation
document.css('div[data-text-ad] a[data-ved]').add_class(ADWORDS_CLASS)
ADWORDS_CLASS is being added for easier manipulation. Adding . to this constant would result in stripping . before using it in this line.
| raise ArgumentError, 'response.body cannot be blank' if html_response.body.blank? | ||
|
|
||
| @html = html_response | ||
|
|
||
| @document = Nokogiri::HTML.parse(html_response) | ||
|
|
||
| # Add a class to all AdWords link for easier manipulation | ||
| document.css('div[data-text-ad] a[data-ved]').add_class(ADWORDS_CLASS) | ||
|
|
||
| # Mark footer links to identify them | ||
| document.css('#footcnt a').add_class('footer-links') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this scope, we should better not raise any errors or customize the document, just initialize some values that we need before moving on.
| raise ArgumentError, 'response.body cannot be blank' if html_response.body.blank? | |
| @html = html_response | |
| @document = Nokogiri::HTML.parse(html_response) | |
| # Add a class to all AdWords link for easier manipulation | |
| document.css('div[data-text-ad] a[data-ved]').add_class(ADWORDS_CLASS) | |
| # Mark footer links to identify them | |
| document.css('#footcnt a').add_class('footer-links') | |
| @html = html_response | |
| @document = Nokogiri::HTML.parse(html_response) if html_response.body |
| end | ||
|
|
||
| # Parse html data and return a hash with the results | ||
| def call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then in this method, we will early return if it does not pass the valid? check.
| def call | |
| def call | |
| return unless valid? | |
def valid?
html.present? && document.present?
endThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7dda9bb. Did a bit of refactoring based on your suggestions. Please let me know if this covers everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please sort the order of methods. We usually put valid?, mark_adword_links, mark_footer_links, present_parsed_data on top of other ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in bd1df63
| status { rand(1..3) } | ||
| raw_response { FFaker::HTMLIpsum.body } | ||
| user_id { demo_user.id } | ||
| user { User.create(email: 'user@demo.com', password: 'Secret@11') } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's specify the class name explicitly
Fabricator(:search_stat, class_name: SearchStat) do| status { rand(1..3) } | ||
| raw_response { FFaker::HTMLIpsum.body } | ||
| user_id { demo_user.id } | ||
| user { User.create(email: 'user@demo.com', password: 'Secret@11') } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need to use a specific user for the search_stat, how about generating randomly, e.g.
Fabricator(:search_stat, class_name: SearchStat) do
...
user { Fabricate(:user) }
endwhen you need to use the fabricator with a specific user, e.g. in the Seed file, it can be done like this
john = User.find(1)
search_stat = Fabricate(:search_stat, user: john)| @uri = URI("#{BASE_SEARCH_URL}?q=#{@escaped_keyword}&hl=#{lang}&gl=#{lang}") | ||
| end | ||
|
|
||
| def call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Has approx 6 statements |
| return false unless valid_result? result | ||
|
|
||
| result | ||
| rescue HTTParty::Error, Timeout::Error, SocketError => e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Has the variable name 'e' |
|
|
||
| private | ||
|
|
||
| def valid_result?(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Doesn't depend on instance state (maybe move it to another class?) |
|
|
||
| def fetch_html_result(keyword) | ||
| Google::ClientService.new(keyword: keyword).call | ||
| rescue StandardError => e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Has the variable name 'e' |
sanG-github
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please help to resolve all the remaining comments, and add the missing tests (client_service.rb)
| queue_as :default | ||
|
|
||
| def perform(search_stat_id:) | ||
| search_stat = SearchStat.find search_stat_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| html_result = Google::ClientService.new(keyword: search_stat.keyword).call | ||
|
|
||
| raise ClientServiceError unless html_result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mosharaf13 Don't know why the changes don't reflect here.
|
|
||
| update_search_stat(search_stat, parsed_attributes) | ||
| rescue ActiveRecord::RecordNotFound, ClientServiceError, ArgumentError | ||
| update_keyword_status search_stat, :failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover, what happens if there is an error that is out of these errors? 🤔 Do we need to roll it back too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ced1a0e Refactored search_keyword_job and client_service as ClientServiceError exception wasn't handled properly.
We can try wrapping up parser_service call method in a rescue and handle exception here. In that case, too many params in rescue method.
current
rescue ActiveRecord::RecordNotFound, ClientServiceError, ArgumentError, ActiveRecord::RecordInvalid
rescue ActiveRecord::RecordNotFound, ClientServiceError, ArgumentError, ActiveRecord::RecordInvalid, ParserServiceError
What we keep this many params or should we create a generic exception class to handle these?
| end | ||
|
|
||
| def ads_page_urls | ||
| document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that, we can use filter with present?,
| document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence } | |
| document.css(".#{ADWORDS_CLASS}").filter { |a_tag| a_tag['href'].present? } |
| end | ||
|
|
||
| # Parse html data and return a hash with the results | ||
| def call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please sort the order of methods. We usually put valid?, mark_adword_links, mark_footer_links, present_parsed_data on top of other ones.
lib/tasks/search_keyword.rake
Outdated
| # Schedule the SearchKeywordJob for background processing | ||
| Google::SearchKeywordJob.perform_later(search_stat_id: 1) | ||
|
|
||
| puts 'SearchKeywordJob scheduled successfully.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curiosity why do need this rake task? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to make present? work with filter. But I keep getting an error (attached).
Here, if I use filter, I have to chain another map method to return only 'href' values. Would it be worth it 😬 ?
Also, if I use select-pluck combo, would that be okay 😀 ?
Here, if present? doesn't provide significant performance improvement, wouldn't filter_map be with presence be better in that case? considering the conciseness 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curiosity why do need this rake task? 🤔
Removing 👍🏼
…f13/google-scrapper-ruby into backend/scrap-google-search-page
| @uri = URI("#{BASE_SEARCH_URL}?q=#{@escaped_keyword}&hl=#{lang}&gl=#{lang}") | ||
| end | ||
|
|
||
| def call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Has approx 6 statements |
| return false unless valid_result? result | ||
|
|
||
| result | ||
| rescue HTTParty::Error, Timeout::Error, SocketError => e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Has the variable name 'e' |
| update_search_stat_status search_stat, :failed | ||
| end | ||
|
|
||
| def update_search_stat(search_stat, attributes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Doesn't depend on instance state (maybe move it to another class?) |
| end | ||
| end | ||
|
|
||
| def update_search_stat_status(search_stat, status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Doesn't depend on instance state (maybe move it to another class?) |
| raise ClientServiceError unless valid_result? result | ||
|
|
||
| result | ||
| rescue HTTParty::Error, Timeout::Error, SocketError, ClientServiceError => e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Has the variable name 'e' |
| results | ||
| end | ||
|
|
||
| def result_link_map(urls, type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Doesn't depend on instance state (maybe move it to another class?) |
closes #21
What happened 👀
Insight 📝
Proof Of Work 📹
Testsuite run stat