Disclaimer: If you don’t know what VIM, WGET, or an URL is, get away from here (to google, of course, and then come back)
so I run into this page:
http://200.14.205.63:8080/portalicfes/home_2/htm/cont_63.jsp?rec=not_4676.jsp
which contains a small amount of links, with ICFES tests (kinda like SAT), and I hit the midle mouse button, just to find out they use “javascript” ARGH!, I think to myself.
but then, looking a litle closer, I find this:
javascript:ventanaNueva(‘../rec/arc_4719.pdf’)
this “rec”directory, arouses my curiosity, so i open it
(HUGE)
http://200.14.205.63:8080/portalicfes/home_2/rec/
there must be 4000 documents or so!
OMG!
I have to download this stuff RIGHT AWAY!
so I proceed to o the following.
1. view source
2. use a VIM regulare expression to find all the links
href=”.*”
and I pass those to a new buffer, (using a VIM macro) q … and then @
however, pressing @@ 4000 times is kinda lame so i just use
(normal mode)
30000@@
which works wonders.
now I have a new buffer with something like this:
3. to get just the URLS, I use a regular expression, like this one:
%s/.*=”\(.*\)”.*/\1
however, all these addresses are in a form:
/a/b/c/x.pdf
but I will need the full url, so I must append the server name:
%s#.*#http://200.14.205.63:8080/&#
now I have something like
BUT I need this to be commands I can run, to download the pages, so I’ll just use wget (look it up)
%s/.*/wget &/
and now I have a 4000 lines file downloading progranm, like:
wget http://200.14.205.63:8080/portalicfes/home_2/rec/arc_1258.xls
wget http://200.14.205.63:8080/portalicfes/home_2/rec/arc_1257.xls
wget http://200.14.205.63:8080/portalicfes/home_2/rec/arc_1256.xls
I know I could have just downloaded HTTrack, or something ilke that, BUT I just couldn’t resist using VIM.
There are easier ways.
all of this could have been avoided, by the sysadmins, If they just had put a “dont show dir contents server directive” on this folder, or an even easier blank index.html file, but they were far too lazy to do that.
Leave a comment