Technology Answer: (Python) socket.gaierror on every addres...except http://www.reddit.com?

I'm just playing around and I'm trying to grab information from websites. Unfortunately, with the following code:

import sys
import socket
import re
from urlparse import urlsplit

url = urlsplit(sys.argv[1])


sock = socket.socket()
sock.connect((url[0] + '://' + url[1],80))
path = url[2]
if not path:
    path = '/'

print path
sock.send('GET ' + path + ' HTTP/1.1\r\n'
    + 'User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/0.3.154.9 Safari/525.19\r\n'
    + 'Accept: */*\r\n'
    + 'Accept-Language: en-US,en\r\n'
    + 'Accept-Charset: ISO-8859-1,*,utf-8\r\n'
    + 'Host: 68.33.143.182\r\n'
    + 'Connection: Keep-alive\r\n'
    + '\r\n')

I get the following error:

Traceback (most recent call last):
File "D:\Development\Python\PyCrawler\PyCrawler.py", line 10, in sock.connect((url[0] + '://' + url[1],80)) File "", line 1, in connect socket.gaierror: (11001, 'getaddrinfo failed')

The only time I do not get an error is if the url passed is http://www.reddit.com. Every other url I have tried comes up with the socket.gaierror. Can anyone explain this? And possibly give a solution?

From stackoverflow

you forgot to resolve the hostname:

addr = socket.gethostbyname(url[1])
...
sock.connect((addr,80))

```
sock.connect((url[0] + '://' + url[1],80))
```
Do not do that, instead do this:
```
sock.connect((url[1], 80))
```
connect expects a hostname, not a URL.

Actually, you should probably use something higher-level than sockets to do HTTP. Maybe httplib.

The.Anti.9 : I've tried that too. It gives me Access Denied errors everywhere.
Please please please please please please please don't do this.

urllib and urllib2 are your friends.

Read the "missing" urllib2 manual if you are having trouble with it.
Have you ever altered your Hosts file? If it has an entry for Reddit but not much else, that might explain that site's unique result.
Use urllib2. Or BeautifulSoup.

Technology Answer

Thursday, March 3, 2011

(Python) socket.gaierror on every addres...except http://www.reddit.com?

0 comments:

Post a Comment

Blog Archive