Skip to content

Conversation

@mosharaf13
Copy link
Contributor

closes #21

What happened 👀

  • Added backend mechanism for scrapping data from google

Insight 📝

  • Added a job that takes a search_stat id as input and after scrapping data from google, updates relevant columns for it

Proof Of Work 📹

Testsuite run stat Screen Shot 2023-06-12 at 4 05 52 PM

…ge' of github.com:mosharaf13/google-scrapper-ruby into backend/scrap-google-search-page
@mosharaf13 mosharaf13 added this to the 0.5.0 milestone Jun 12, 2023
@mosharaf13 mosharaf13 self-assigned this Jun 12, 2023
@mosharaf13 mosharaf13 requested a review from a user June 12, 2023 09:13
@github-actions
Copy link

github-actions bot commented Jun 12, 2023

Code coverage is now at 0.00% (0/204 lines)

Generated by 🚫 Danger

Copy link
Collaborator

@sanG-github sanG-github left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some early suggestions 😉 Please take a look at your code to re-format it, I saw some empty lines in the service. 🙇
Hope that is useful for you.

class SearchKeywordJob < ApplicationJob
queue_as :default

def perform(search_stat_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should better make the call explicit with the keyword arguments.

Suggested change
def perform(search_stat_id)
def perform(search_stat_id:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 45c5304

end

def ads_page_urls
document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to return true/false.

Suggested change
document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence }
document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].present? }

Copy link
Contributor Author

@mosharaf13 mosharaf13 Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting back to presence here 57afc7e as present? returns a collection of only true values, not actual urls

Screen Shot 2023-06-22 at 11 02 51 AM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that, we can use filter with present?,

Suggested change
document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence }
document.css(".#{ADWORDS_CLASS}").filter { |a_tag| a_tag['href'].present? }

status { rand(1..3) }
raw_response { FFaker::HTMLIpsum.body }
user_id { demo_user.id }
user { User.create(email: 'user@demo.com', password: 'Secret@11') }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we do it explicitly as previously? 💭

Suggested change
user { User.create(email: 'user@demo.com', password: 'Secret@11') }
user { demo_user }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating explicit user throws below exception while running tests. However, implicit creation doesn't throw any error. If it's required to use explicit declaration, we will have to find other ways to solve the problem

Screen Shot 2023-06-13 at 2 34 29 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's specify the class name explicitly

Fabricator(:search_stat, class_name: SearchStat) do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to use a specific user for the search_stat, how about generating randomly, e.g.

Fabricator(:search_stat, class_name: SearchStat) do
  ...
  user { Fabricate(:user) }
end

when you need to use the fabricator with a specific user, e.g. in the Seed file, it can be done like this

john = User.find(1)

search_stat = Fabricate(:search_stat, user: john)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that, I thought we were in the seeds file. How about using Fabricate instead of creating a new one manually?

Suggested change
user { User.create(email: 'user@demo.com', password: 'Secret@11') }
user { Fabricate(:user) }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5c5eaad


html_result = Google::ClientService.new(keyword: search_stat.keyword).call

raise ClientServiceError unless html_result
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does NOT make total sense as the returned data can be blank, right? 🙊

IMO, I think we can separate it into a method, and then use the begin/rescue to catch the errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 122de2f

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mosharaf13 Don't know why the changes don't reflect here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure about that. This is how the job looks currently.

Copy link
Collaborator

@sanG-github sanG-github left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions, please update tests for the missing service.

def perform(search_stat_id:)
search_stat = SearchStat.find search_stat_id
html_result = fetch_html_result(search_stat.keyword)
update_search_stat search_stat, ParserService.new(html_response: html_result).call
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more explicit and readable.
First, we call the service and assign the returned data to the variable, then, we pass the variable to the update method.

Suggested change
update_search_stat search_stat, ParserService.new(html_response: html_result).call
parsed_attributes = ParserService.new(html_response: html_result).call
update_search_stat(search_stat, parsed_attributes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 40d3740

queue_as :default

def perform(search_stat_id:)
search_stat = SearchStat.find search_stat_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the SearchStat is not found? 🤔 (Maybe it was deleted before the job ran)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a possibility 🤔? Should this case be handled?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure Mosharaf, we must handle this case. 🙏

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b1fde06

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the fix, it will never reach the line return unless search_stat when there is no search_stat.
You can take a look at the difference between find and find_by.

Comment on lines 16 to 17
Rails.logger.error("Error while fetching HTML result: #{e.message}")
raise ClientServiceError, 'Error fetching HTML result'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never reach that codes as we already rescue the errors in the ClientService and then return false.
Otherwise, if there is another error that is out of HTTParty::Error, Timeout::Error, SocketError, we can also catch it in the ClientService.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, should this section be removed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is another error that is out of HTTParty::Error, Timeout::Error, SocketError, we can also catch it in the ClientService.

We can remove it in this service, but please handle it in the ClientService alternately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 582a136

Comment on lines 22 to 24
# rubocop:disable Rails/SkipsModelValidations
search_stat.result_links.insert_all attributes[:result_links]
# rubocop:enable Rails/SkipsModelValidations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to disable this rule, usually we find a way to work around this instead of disabling the rule each time we faced it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should better use create instead of insert_all, then we won't miss any validations as we don't want to create a record with an url.

validates :url, presence: true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8051e7c

# Inspect Http response status code
# Any non 200 response code will be logged
def valid_result?(result)
return true if result&.response&.code == '200'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can early return false if there is no result, so we don't need the safe navigator anymore. ✨

Suggested change
return true if result&.response&.code == '200'
return false unless result
return true if result.response.code == SUCCESS_STATUS_CODE

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please define a constant for SUCCESS_STATUS_CODE = '200'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bce3b3e

def valid_result?(result)
return true if result&.response&.code == '200'

Rails.logger.warn "Warning: Query Google with '#{@escaped_keyword}' return status code #{result.response.code}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not belongs to the valid_result? method, we only expect it returns the valid? property of the result, not to write the log. (We can move it to the call method)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bce3b3e

Comment on lines +6 to +7
AD_CONTAINER_ID = 'tads'
ADWORDS_CLASS = 'adwords'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that if we use ID or CLASS, we need to manually add a corresponding prefix # or . for it.
So, what do you think about defining it as a SELECTOR as you have done with NON_ADS_RESULT_SELECTOR

Suggested change
AD_CONTAINER_ID = 'tads'
ADWORDS_CLASS = 'adwords'
AD_CONTAINER_ID = '#tads'
ADWORDS_CLASS = '.adwords'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

      # Add a class to all AdWords link for easier manipulation
      document.css('div[data-text-ad] a[data-ved]').add_class(ADWORDS_CLASS)

ADWORDS_CLASS is being added for easier manipulation. Adding . to this constant would result in stripping . before using it in this line.

Comment on lines 10 to 20
raise ArgumentError, 'response.body cannot be blank' if html_response.body.blank?

@html = html_response

@document = Nokogiri::HTML.parse(html_response)

# Add a class to all AdWords link for easier manipulation
document.css('div[data-text-ad] a[data-ved]').add_class(ADWORDS_CLASS)

# Mark footer links to identify them
document.css('#footcnt a').add_class('footer-links')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this scope, we should better not raise any errors or customize the document, just initialize some values that we need before moving on.

Suggested change
raise ArgumentError, 'response.body cannot be blank' if html_response.body.blank?
@html = html_response
@document = Nokogiri::HTML.parse(html_response)
# Add a class to all AdWords link for easier manipulation
document.css('div[data-text-ad] a[data-ved]').add_class(ADWORDS_CLASS)
# Mark footer links to identify them
document.css('#footcnt a').add_class('footer-links')
@html = html_response
@document = Nokogiri::HTML.parse(html_response) if html_response.body

end

# Parse html data and return a hash with the results
def call
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then in this method, we will early return if it does not pass the valid? check.

Suggested change
def call
def call
return unless valid?
def valid?
  html.present? && document.present?
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7dda9bb. Did a bit of refactoring based on your suggestions. Please let me know if this covers everything.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sort the order of methods. We usually put valid?, mark_adword_links, mark_footer_links, present_parsed_data on top of other ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bd1df63

status { rand(1..3) }
raw_response { FFaker::HTMLIpsum.body }
user_id { demo_user.id }
user { User.create(email: 'user@demo.com', password: 'Secret@11') }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's specify the class name explicitly

Fabricator(:search_stat, class_name: SearchStat) do

status { rand(1..3) }
raw_response { FFaker::HTMLIpsum.body }
user_id { demo_user.id }
user { User.create(email: 'user@demo.com', password: 'Secret@11') }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to use a specific user for the search_stat, how about generating randomly, e.g.

Fabricator(:search_stat, class_name: SearchStat) do
  ...
  user { Fabricate(:user) }
end

when you need to use the fabricator with a specific user, e.g. in the Seed file, it can be done like this

john = User.find(1)

search_stat = Fabricate(:search_stat, user: john)

@uri = URI("#{BASE_SEARCH_URL}?q=#{@escaped_keyword}&hl=#{lang}&gl=#{lang}")
end

def call
Copy link

@github-actions github-actions bot Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Has approx 6 statements

return false unless valid_result? result

result
rescue HTTParty::Error, Timeout::Error, SocketError => e
Copy link

@github-actions github-actions bot Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Has the variable name 'e'


private

def valid_result?(result)
Copy link

@github-actions github-actions bot Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Doesn't depend on instance state (maybe move it to another class?)


def fetch_html_result(keyword)
Google::ClientService.new(keyword: keyword).call
rescue StandardError => e

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Has the variable name 'e'

@mosharaf13 mosharaf13 requested review from longnd and sanG-github June 22, 2023 09:36
Base automatically changed from ui/list-keywords to develop June 23, 2023 03:42
Copy link
Collaborator

@sanG-github sanG-github left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please help to resolve all the remaining comments, and add the missing tests (client_service.rb)

queue_as :default

def perform(search_stat_id:)
search_stat = SearchStat.find search_stat_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the fix, it will never reach the line return unless search_stat when there is no search_stat.
You can take a look at the difference between find and find_by.


html_result = Google::ClientService.new(keyword: search_stat.keyword).call

raise ClientServiceError unless html_result
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mosharaf13 Don't know why the changes don't reflect here.


update_search_stat(search_stat, parsed_attributes)
rescue ActiveRecord::RecordNotFound, ClientServiceError, ArgumentError
update_keyword_status search_stat, :failed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, what happens if there is an error that is out of these errors? 🤔 Do we need to roll it back too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ced1a0e Refactored search_keyword_job and client_service as ClientServiceError exception wasn't handled properly.

We can try wrapping up parser_service call method in a rescue and handle exception here. In that case, too many params in rescue method.

current


rescue ActiveRecord::RecordNotFound, ClientServiceError, ArgumentError, ActiveRecord::RecordInvalid


rescue ActiveRecord::RecordNotFound, ClientServiceError, ArgumentError, ActiveRecord::RecordInvalid, ParserServiceError

What we keep this many params or should we create a generic exception class to handle these?

end

def ads_page_urls
document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that, we can use filter with present?,

Suggested change
document.css(".#{ADWORDS_CLASS}").filter_map { |a_tag| a_tag['href'].presence }
document.css(".#{ADWORDS_CLASS}").filter { |a_tag| a_tag['href'].present? }

end

# Parse html data and return a hash with the results
def call
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sort the order of methods. We usually put valid?, mark_adword_links, mark_footer_links, present_parsed_data on top of other ones.

# Schedule the SearchKeywordJob for background processing
Google::SearchKeywordJob.perform_later(search_stat_id: 1)

puts 'SearchKeywordJob scheduled successfully.'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curiosity why do need this rake task? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to make present? work with filter. But I keep getting an error (attached).

Here, if I use filter, I have to chain another map method to return only 'href' values. Would it be worth it 😬 ?

Also, if I use select-pluck combo, would that be okay 😀 ?

Here, if present? doesn't provide significant performance improvement, wouldn't filter_map be with presence be better in that case? considering the conciseness 🤔

Screen Shot 2023-06-29 at 1 10 14 PM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curiosity why do need this rake task? 🤔

Removing 👍🏼

@mosharaf13 mosharaf13 requested a review from sanG-github June 29, 2023 04:09
@uri = URI("#{BASE_SEARCH_URL}?q=#{@escaped_keyword}&hl=#{lang}&gl=#{lang}")
end

def call
Copy link

@github-actions github-actions bot Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Has approx 6 statements

return false unless valid_result? result

result
rescue HTTParty::Error, Timeout::Error, SocketError => e
Copy link

@github-actions github-actions bot Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Has the variable name 'e'

update_search_stat_status search_stat, :failed
end

def update_search_stat(search_stat, attributes)
Copy link

@github-actions github-actions bot Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Doesn't depend on instance state (maybe move it to another class?)

end
end

def update_search_stat_status(search_stat, status)
Copy link

@github-actions github-actions bot Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Doesn't depend on instance state (maybe move it to another class?)

raise ClientServiceError unless valid_result? result

result
rescue HTTParty::Error, Timeout::Error, SocketError, ClientServiceError => e
Copy link

@github-actions github-actions bot Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Has the variable name 'e'

results
end

def result_link_map(urls, type)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Doesn't depend on instance state (maybe move it to another class?)

@sanG-github sanG-github removed their request for review September 26, 2023 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants