close

Privacy guaranteed - Your email is not shared with anyone.

Extracting Data from Web Site

Discussion in 'Tech Talk' started by N2DFire, Jan 22, 2005.

  1. N2DFire

    N2DFire Who Me ???

    Joined:
    Apr 29, 2003
    Messages:
    149
    Likes Received:
    0
    Location:
    Ferrum, VA
    Alright, I know there should be an easy way to do this but for the life of me I can't find it.

    I (or actually my g/f) has a text book that has a web site with additional study aids. One of these aids is a section of Flash Cards. She also has a nifty little flash card program for her PDA that will take a .txt file (in proper format) and display them as flash cards.

    What we are wanting to do is somehow extract the data from the web site's flash cards and put them into a text file.

    I'm comfortable enough with VB.NET text file manipulation (streamreader) that this shouldn't be a problem however I can't get to the dumb HTML files to open them (because streamwriter won't accept a URL) and I can't seem to find a good program to copy the website to my computer HD.

    The web site is set up such that there is a flashcard page that contains a lot of javascript that makes the system work. Under that there are sub folders for each chapter

    /Chapter1
    .
    .
    .
    /ChapterXX

    In each chapter folder there are card files
    /card1.html
    /card2.html
    .
    .
    .
    /cardXX.html

    Every card has the following format
    PHP:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
    ******
    Each flashcard file contains term and definition for one card
    as JavaScript variables.  Page is written dynamically via
    JavaScript 
    function.  All logic and style resides in shared files.
    State variables in parent frameset determine how/when a single card
    appears
    .
    -->
    <
    html>
    <
    head>
        <
    title>Card</title>
        ******** 
    language="JavaScript">
            
    //data for this card
            
    var term "adenohypophysis "
            
    var def "The anterior lobe of the pituitary gland."
            
    var audio "none"
        
    </script>
        ******** language="JavaScript" src="../card.js"></script>
        <LINK REL="Stylesheet" TYPE="text/css" HREF="../card.css">
    </head>
    ****** *********"#ffffff" ************"../card.gif">
    ******** language="JavaScript">
        //write the card
        writeCard()
    </script>
    ********>
    ********>
    This page is http://media.pearsoncmg.com/bc/bc_martini_fap_6/flashcards/chapter18/card1.html

    I can call each card page up on it's own but because the system was written as a frameset the code required to make it display properly is not present.

    What I need in a nutshell is a way to extract the value of var term & var def from each card file so that I can then write them out into a formatted .txt file for the PDA flashcard program.

    Any help with accessing the on-line HTML files via VB.NET or an good program to cache them to my HD so I can do it the "old" way I know would be greatly appreciated.

    Thanks in Advance

    Edited to fix URL
     
  2. Deathwind

    Deathwind

    Joined:
    Aug 13, 2002
    Messages:
    127
    Likes Received:
    0
    Location:
    In my pants
    Here's some quick-n-dirty Perl code that will do what you want:

    Code:
    #!/usr/bin/perl -w
    
    use LWP::Simple;
    
    $url = 'http://place.with.the.stuff.com/card.html';
    
    $content = get($url);
    
    if($content =~ m/.*var.term.=.\"(.*)\".*/) {
    	print("$1\n");
    }
    if($content =~ m/.*var.def.=.\"(.*)\".*/) {
    	print("$1\n");
    }
    
    Should be pretty easy to modify it to loop over the card numbers as well.

    Or, if you have an aversion to Perl (although it's great for short little junk with text like this), HTTrack or wget are my preferred website downloaders.
     

  3. N2DFire

    N2DFire Who Me ???

    Joined:
    Apr 29, 2003
    Messages:
    149
    Likes Received:
    0
    Location:
    Ferrum, VA
    Deathwind - Thanks for the reply. I dunno PEARL but it's high time I started learning I guess so I'll give that a look-see.

    Also I tried the windows version of HTTrack and gave it "http://media.pearsoncmg.com/bc/bc_martini_fap_6/flashcards/flashcards.html" as the starting page and (I thought) told it to get everything below it, however it only retrieves the 7 or so files in that folder and will not recurse into the ../chapterXX sub folders.

    Any pointers on how to make HTTrack work and I'll have the problem solved. I can write the VB.NET to read the files localy, I just haven't yet figured out how to read them off the web.
     
  4. David_G17

    David_G17 /\/\/\/\/\/\/\/

    Joined:
    Oct 7, 2002
    Messages:
    2,046
    Likes Received:
    0
    just skimmed over the post, but would "wget" solve your problem?

    (downloading the site locally)
     
  5. N2DFire

    N2DFire Who Me ???

    Joined:
    Apr 29, 2003
    Messages:
    149
    Likes Received:
    0
    Location:
    Ferrum, VA
  6. N2DFire

    N2DFire Who Me ???

    Joined:
    Apr 29, 2003
    Messages:
    149
    Likes Received:
    0
    Location:
    Ferrum, VA
    YEA PERL !!!!

    I did it (well sort of). It still has an error that I need to trap/fix and it's not the best means of looping but all-in-all, not bad for a first timer I don't think.

    Code:
    #!/usr/bin/perl -w
    
    use LWP::Simple;
    open(FD, ">A&P_Chapter18.txt");
    $X = 1;
    $url = 'http://media.pearsoncmg.com/bc/bc_martini_fap_6/flashcards/chapter18/card' . $X . '.html';
    $content = get($url);
    
    while($content ne '') {
    #	print $X;
    
    	if($content =~ m/.*var.term.=.\"(.*)\".*/) {
    #		print ("Q: $1\n");
    		print FD ("Q: $1\n");
    	}
    
    	if($content =~ m/.*var.def.=.\"(.*)\".*/) {
    #		print ("A: $1\n");
    		print FD ("A: $1\n");
    	}
    	print FD (" \n");
    $X += 1;
    $url = 'http://media.pearsoncmg.com/bc/bc_martini_fap_6/flashcards/chapter18/card' . $X . '.html';
    $content = get($url);
    }
    close(FD);
    
    It would have taken me forever to get the text matching part though. Many many thanks to Deathwind for the starting point.