Skip to content

Fix video url scraping#285

Open
makamys wants to merge 1 commit intotaspinar:masterfrom
makamys:master
Open

Fix video url scraping#285
makamys wants to merge 1 commit intotaspinar:masterfrom
makamys:master

Conversation

@makamys
Copy link
Copy Markdown

@makamys makamys commented Apr 21, 2020

The HTML element that the video url was getting scraped no longer exists, so video_div.find('a') returned None, and this made tweets containing videos fail getting scraped.
I changed it to use regex to extract the video id, and construct the video url from it.

@someguy-2020
Copy link
Copy Markdown

I had to change line 83 to:
video_id = re.search(r"https://pbs.twimg.com/ext_tw_video_thumb/(.*)\.jpg", str(video_div)).group(1)
[tweet_video_thumb --> ext_tw_video_thumb]
to get the proper video image URL. Unfortunately, this doesn't provide the proper video_url. Any idea what the video_url is based on the video img url?

@makamys
Copy link
Copy Markdown
Author

makamys commented Apr 24, 2020

Oh dang, it looks like it wasn't as simple as I was hoping. It turns out short videos have the thumbnail image in a format like tweet_video_thumb/<VIDEO ID>.jpg, and for those, my code works.

But longer videos are in the format of ext_tw_video_thumb/<TWEET ID>/pu/img/<THUMBNAIL ID>.jpg like you posted. Those videos are streamed via HLS, and the web app makes an API call (https://api.twitter.com/1.1/videos/tweet/config/<TWEET ID>.json) to find the m3u8 that contains the segments (which is in the form of https://video.twimg.com/ext_tw_video/<TWEET_ID>/pu/pl/<VIDEO ID>.m3u8).

Using <THUMBNAIL ID> as the <VIDEO ID> doesn't work though, and there's no reference to the <VIDEO ID> in the html served. So there may not be a way to get the video url without making an API call.

By the way, youtube-dl uses the API with a guest token to get the video url (see twitter.py, relevant discussion here).


As a workaround, the video url could be set to the tweet's url so at least tweets with videos don't get skipped. My use case for twitterscraper didn't include scraping tweets with long videos though, so I won't be fixing this myself, but hopefully these notes will be useful to someone else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants