For a single domain spider with the additional features, plus a light-weight tool and rapid scan, I would read the html, java script and style sheets entirely (since all of them may contain links), and try to read the page headers only for all the rest of the files. However, if page headers aren't allowed from server settings, i would try to read the files partly and drop the connection after the file size information has been sent. After reading the links, i would parse them all into a database and filter the unique entries, and update the records with the relative data.