Web-scraping with requests-html behinde proxy/vpn

Hi,
I have built a web-scraper to scrape data from JavaScript filled fields and tables. I used requests-html for this because it was much faster than using selenium/headless chrome and it still had support for rendering JavaScript data.

Now I want to put it behind some kind of proxy/vpn. I would prefer a socks5 kind of setup thinking this would be faster, but I honestly have no idea how I would set this up.

My code to get the rendered html looks like this at the moment…

def scrape(URL):
     from requests_html import HTMLSession
     session = HTMLSession()
     resp = session.get(URL)
     wait = resp.html.render(timeout=30)
     session.close()
     return resp.html.html 

I post here hoping someone has experience with this kind of thing and can point me in the right direction from here. Even though you don’t have experience a qualified guess might be good as well.

If you’re using a VPN, you should not need to do anything other than connect to it, all the traffic on the computer/server where the script is running will be routed through the VPN tunnel.

If you’re using a proxy, you should just need to add a dictionary containing the address and type of proxy.

from requests_html import HTMLSession
def scrape(URL):
     my_proxy = {"sock5": "socks5://username:password@address:port/"}
     session = HTMLSession()
     resp = session.get(URL, proxies=my_proxy)
     wait = resp.html.render(timeout=30)
     session.close()
     return resp.html.html

^(session = HTMLSession(browser_args=[“–proxy-server=)^(64.227.34.111:3128)^(”]))

^^That works but does not support authenticated proxy :frowning:

So my solution will be a VirtualMachine with a VPN installed.

VPN is the solution for now. But the goal is having a proxy dict like your example.
Well my consern here is that the ‘render()’ method use internet access so only the ‘get()’ method will use the proxy.

How would i test this?

I don’t know if that issue has been fixed/resolved but there seems to be a solution on requests-html’s github page.
solution

Thanks I will test it out later :slight_smile: