VoiceXML 2.1 Development Guide Home  |  Frameset Home

  tutorial Event Logging  |  TOC  |  Intro to Server Side  

Tutorial: Screen Scraping

Step 1: Some words about Screen Scraping

For some, the phrase "screen scraping" may evoke curious memories of cleaning your screened-in back porch. But in our digital world of code larceny, "screen scraping" refers to the act by which you access a website, gather up the output from that site, and use it for your own purposes. Perennial favorites include grabbing results from search engines, map-building sites, portals, and (of course), Florida State University's homepage.

But what is the purpose, you may ask, of screen scraping? Well, let's say, for example, you wanted to have a small blurb on your website about local weather conditions in Fargo, North Dakota. Do you have a direct feed to the national weather service? Probably not. But, you are a highly skilled coder, with a special place in your heart for borrowing code (your username in most chat rooms is "1337HTMLCatBurglar7") -- you can simply hit their website, parse through the data listed there, and then use what you need for your own website (which is actually a lucrative e-business venture selling cathode ray tubes to band societies across three continents).


Step 2: PHP Code 


Note: For the below code to work, you will need to have PHP installed on your host server. Again, the voxeo webhosting drive does not support PHP, or any kind of backend scripting whatsoever.

So let's examine a hypothetical application that searches Google.com's search engine for the phrase "Why cats are better than dogs". You may be surprised at how little code it takes to actually snag the website -- in fact, it is much trickier to actually parse and remove all of the extraneous pieces of information you do not wish to end up in your VoiceXML application. So here is what it might look like:


<?php

//This is so the page does not get cached
  header('Cache-Control: no-cache');

//Opening VXML tags
  echo "\t" .'<?xml version="1.0" encoding="UTF-8" ?>
    <vxml version = "2.1">
      <meta name="maintainer" content="YOUR_EMAIL@WHEREEVER.COM"/>
      <meta name="author" content="Donald Lawson"/>

          <form>
<block>'. "\n";

          //Define our query to send to Google.com
                  $SearchString = "%22why+cats+are+better+than+dogs%22";


              /***************
              Grab the URL
              ***************/

      //Query google

            $query = "http://www.google.com/search?hl=en&lr=&q=". $SearchString;
                $result = file($query);
                $one_liner = "";

                if(sizeof($result) != 0) {

                    foreach($result as $val) {
                        $one_liner .= rtrim($val, "\n");
                    }

                    $pattern = '/Results\D*\d+\D*\d+\D*(\d+)/';
                    preg_match($pattern, $one_liner, $matches);


                    echo "\t\t\t\t<prompt>There were a total of ". $matches[1] . " matches!
                    Here are the first ten.<break time=\"1000\" /></prompt>\n";


                  $pattern = '/<\s*p\s*class\s*=\s*g\s*>\s*<\s*a\s*href\s*=\s*"?([^">]+)"?>((.(?!<\s*br\s*\/?\s*>))*.)((.(?!
                  <\s*font\s*color\s*=\s*"?#008000"?\s*\/?\s*>))*.)/';
                  preg_match_all($pattern, $one_liner, $matches, PREG_SET_ORDER);

                  $counter = 1;
                    foreach($matches as $val) {
                          echo "\t\t\t\t<prompt>Match Number ". $counter ."<break time=\"1000\" />".
                          fixquotes(htmlentities(strip_tags($val[2]))) .", URL: ".
                          fixquotes(htmlentities(strip_tags($val[1]))) .".
                          Description: ". fixquotes(htmlentities(strip_tags($val[4]))) . "</prompt>\n";
                                $counter++;
                        }

                      } else {

                      //Output an error if the URL returned nothing

                        echo '<prompt>Error opening URL.  Try Again Later</prompt>';

                        }

                    //Close out the VXML doc

                        echo "\t\t\t </block>
                </form>
              </vxml>";

            //This function just fixes single and double quotes so they aren't represent as html entities for
            //compatibility with out VXML speech machine


            function fixquotes($incoming) {
                  $incoming = str_replace("&amp;#39;", "'", $incoming);
                  $incoming = str_replace("&amp;quot;", "\"", $incoming);

                  return $incoming;
                }
?>



Notice these key lines in our php document::


foreach($result as $val) {
          $one_liner .= rtrim($val, "\n");

This snippet of PHP removes all of the end of line characters, and creates one single string containing the data.



if(sizeof($result) != 0) {
          foreach($result as $val) {
          $one_liner .= rtrim($val, "\n");
}

And this humble chunk of code start dissecting our results, (assuming that  the URL returned a nonempty result).


$pattern = '/Results\D*\d+\D*\d+\D*(\d+)/';
preg_match($pattern, $one_liner, $matches);

Here, we use regular expressions to find out how many matches were returned by Google.



echo "\t\t\t\t<prompt>There were a total of ". $matches[1] . " matches!
Here are the first ten.<break time=\"1000\" /></prompt>\n";

Above, we output the voice prompt that tells our callers how many matches were returned from the Google query; since there are bound to be more than a few matches, we will insert a <break> element so that we don't have a huge run-on sentence, (like this one that you are currently reading), because otherwise, this would simply be poor authoring on our part.


$pattern = '/<\s*p\s*class\s*=\s*g\s*>\s*<\s*a\s*href\s*=\s*"?([^">]+)"?>((.(?!<\s*br\s*\/?\s*>))*.)
((.(?!<\s*font\s*color\s*=\s*"?#008000"?\s*\/?\s*>))*.)/';
preg_match_all($pattern, $one_liner, $matches, PREG_SET_ORDER);


This is insanely brilliant regular expression  gives us everything we want about each result in one line of code. But insane brilliance, amusing code samples, and shameless self aggrandization is what you come to expect from those lovable Voxeo lunatics. Humility is cheerfully tosssed out the window, (as evidenced by our fifth match from Google):




$counter = 1;
    foreach($matches as $val) {

      echo "\t\t\t\t<prompt>Match Number ". $counter ."<break time=\"1000\" />". fixquotes(htmlentities(strip_tags($val[2]))) .", URL: ".
      fixquotes(htmlentities(strip_tags($val[1]))) .". Description: ". fixquotes(htmlentities(strip_tags($val[4]))) . "</prompt>\n";
      $counter++;
}

Lastly, we come to the metaphorical Finish Line, where we output all of the results to our VoiceXML document


For illustrative purposes, here are our two links in actual working order:

Google's normal search results:





VoiceXML version search results:






We can see that we have chopped away everything but the basic information. It is worth noting that screen scraping assumes you know the format of the output. Were Google to change their display style, it would make this application (and this tutorial) non-functional because we are looking for specific string sequences to delineate data. This is the weakness of screen scraping as a technique.


Step 5: Upload, and try it out.

Your "Hello World" with Screen Scraping is now done. Call the number associated with your VoiceXML application, and you'll find that the results of this tutorial work just like the previous one. Except now, we have at our disposal the entire Internet to plunder for content over the telephone.

Download the Code!

  Source code




  ACCOUNT LOGIN
Username:  
Password:  
  You must login with your Voxeo developer account prior to posting or editing your existing posts. If you aren't a member of Voxeo's developer community, click here to register.
  ANNOTATIONS: EXISTING POSTS
bfoster63
10/25/2004 2:07 PM (EDT)
Has anyone recently worked with the PHP version and have it work?  Does it need tweaking?  Does not work for me but I am not a PHP expert by any means.  Thanks.  Bob
MattHenry
10/25/2004 6:16 PM (EDT)
Hi Bob,


Apparantly, Excite has drastically changed their search page, making this tutorial somewhat invalid. Sorry about that; it's difficult to maintain this tutorial when those folks keep improving thier content.

8^P

However, I will make an effort in the next week to redo this tutorial. I think instead of a true screen scrape, it might be best to show off how the <data> element can pull such info off of a Google XML feed.

Sorry for any confusion,

~Matt
bfoster63
2/1/2005 2:09 PM (EST)
Matt and Donald,

Very nice job!  It works beautifully now.  Thank you.  Any other PHP examples used with VXML?

Bob bfoster63
Michael.Book
2/1/2005 6:53 PM (EST)
Howdy "bfoster63,"

We do indeed have a couple of PHP / VoiceXML applications for you at:  http://community.voxeo.com/vxml/apps/home.jsp


Have Fun,

~ Michael Book
steven_shatz
1/9/2006 11:37 AM (EST)
Has anyone gotten this version to work recently?  The php file() command does not seem to work for search related queries.  I was wondering if this might be a result of the new Google Web API and the new need for an account to pull searches.

Thanks Steve
blitzie
1/13/2006 10:13 PM (EST)
bfoster63 and MattHenry

I'm a student trying to learn voicexml for my project. I'm just wondering how you're able to test your PHP/VoiceXML applications.

I mean, where do you upload your PHP/VoiceXML files?

I'm not even sure if VoiceXML will work with PHP and MySQL (it's because I'll be using a database for my application too) ...

Thanks a lot ^_^
kevinlim
1/13/2006 11:59 PM (EST)
Howdy blitzie,

Thanks for your question. 

Feel free to host static VoiceXML (*.vxml) pages right here with us.  Check out the "Files, Logs, and Reports" section of your accoung page.  You'll have to make sure that your application source is within the "www" subdirectory, so our Voice servers can access it.

However, we do not offer hosting with application server functionality, for JSP, PHP, CFM, etc.  It is up to you to find a proper host for that.  Anywhere they reside, the Voxeo servers can grab them by URL (provided they are accessible by URL).

You can certainly use a back-end PhpMySQL integration (voice apps wouldn't be very interesting if they couldn't be dynamic, or access databases). Simply use PHP to grab/update fields in your DB, and dynamically generate proper voiceXML for the voice browsers to render.

Check out our tutorial on how to pass variables to-and-fro between the VoiceXML context and the server-side context:

http://docs.voxeo.com/voicexml/2.0/frame.jsp?page=qs_vars.htm

^____^


Best,
Kevin
Voxeo Extreme Support
atomical
1/30/2006 3:57 PM (EST)
A PHP script with a two line REGEXPS is simple? :P

One thing you might want to keep in mind with an extensive php/mysql/voicexml application is using sessions.  Voxeo will always send a session variable of their own to your php application.  You should set this as your session variable using session_id(); 
Michael.Book
1/30/2006 5:54 PM (EST)
Hey atomical,

Good point about using PHP (or any other server-side language for that matter) session vars. 

Although, just to make sure there is no confusion out there...  While atomical's suggestion of using the XML's sessionID for your server-side SID may indeed help many of you feel all clean and tidy, it is, of course, not a requirement.  Nor is it required that you explicitly pass the SID in your GET or POST requests.  Our browsers - CallXML, VoiceXML, and CCXML - do support non-persistent cookies just fine and dandy, which I believe is even the default method that PHP uses to maintain session data...  :-)

Anyway, thanks for the post.  Everybody have fun...


~ Michael
kvenki4u
4/24/2010 4:57 PM (EDT)
Hey guys,

Can anyone please tell me how to run this PHP code to get the voicexml code form the script.

thank you.
VoxeoDante
4/25/2010 12:05 AM (EDT)
Hello,

You should be able to copy the code here into a .php page, and then host it on a PHP server.  Once you have done that, all you should need to do it hit is with your application, or a web browser and it should return VoiceXML.

If that is not happening for you, please let us know what results you are getting, and we can get things working for you.

Regards,
Dante Vitulano
Customer Support Engineer II
Voxeo Corporation

  tutorial Event Logging  |  TOC  |  Intro to Server Side  

© 2013 Voxeo Corporation  |  Voxeo IVR  |  VoiceXML & CCXML IVR Developer Site