Hi,
I have built a web-scraper to scrape data from JavaScript filled fields and tables. I used requests-html for this because it was much faster than using selenium/headless chrome and it still had support for rendering JavaScript data.
Now I want to put it behind some kind of proxy/vpn. I would prefer a socks5 kind of setup thinking this would be faster, but I honestly have no idea how I would set this up.
My code to get the rendered html looks like this at the moment…
def scrape(URL):
from requests_html import HTMLSession
session = HTMLSession()
resp = session.get(URL)
wait = resp.html.render(timeout=30)
session.close()
return resp.html.html
I post here hoping someone has experience with this kind of thing and can point me in the right direction from here. Even though you don’t have experience a qualified guess might be good as well.
If you’re using a VPN, you should not need to do anything other than connect to it, all the traffic on the computer/server where the script is running will be routed through the VPN tunnel.
If you’re using a proxy, you should just need to add a dictionary containing the address and type of proxy.
from requests_html import HTMLSession
def scrape(URL):
my_proxy = {"sock5": "socks5://username:password@address:port/"}
session = HTMLSession()
resp = session.get(URL, proxies=my_proxy)
wait = resp.html.render(timeout=30)
session.close()
return resp.html.html
^(session = HTMLSession(browser_args=[“–proxy-server=)^(64.227.34.111:3128)^(”]))
^^That works but does not support authenticated proxy 
So my solution will be a VirtualMachine with a VPN installed.
VPN is the solution for now. But the goal is having a proxy dict like your example.
Well my consern here is that the ‘render()’ method use internet access so only the ‘get()’ method will use the proxy.
How would i test this?
I don’t know if that issue has been fixed/resolved but there seems to be a solution on requests-html’s github page.
solution
Thanks I will test it out later 