Fun in the Terminal With Lynx
The GNU/Linux command line gives you a lot of small tools that can be connected with each other by piping the output of one tool into another tool.
For example, you might see a page with a lot of links on it that you want to examine more closely. You could open up a terminal and type something like the following:
$ lynx -dump "http://www.example.com" | grep -o "http:.*" >file.txt
That will give you a list of outgoing links on the web page at http://www.example.com, nicely printed to a file called file.txt in your current directory.
Here’s how it works:
Lynx is a Web browser that only reads text. This makes it great for extracting text from web pages. The option -dump tells Lynx to grab the web page and display it in the terminal. That is followed by the URL you want to visit. So lynx -dump “http://www.example.com” is just saying, “Lynx, dump the output of http://www.example.com to the screen”.
You can try the first part by itself to see what it does, replacing http://www.example.com with another URL of your choice. In the following example I’ve used the home page of the the BBC news:
$ lynx -dump "http://news.bbc.co.uk"
Notice in the image snippet below of the tail end of Lynx’s output, that Lynx gives a list of URLs, proceeded by numbers. We are going to extract only those URLs from the output in the next step:
Extracting the Links from Lynx
Now we can look at the next part of the URL extraction process:
$ lynx -dump "http://www.example.com" | grep -o "http:.*" >file.txt
When you use a pipe (the | symbol), it tells the computer to take the output from the first tool and send it to the following tool. So we are taking the output of Lynx and sending it to grep.
Grep is a tool to search for text and display each line that contains a matching pattern. The option, -o tells grep to only return the matching part of the line and not the entire line. We are searching for anything that matches “http:.*”, which is a simple regular expression.
A regular expression is a pattern that is made up of symbols that tell the computer what to look for in order to make a match. We want to find anything that matches the pattern: http: [and anything that comes after that]. A period (.) in a regular expression symbolizes one character of any type. The asterisk (*) symbolizes zero or more of the preceeding character. So “http.*” means “match ‘http’ and any number of characters that follow it”. This will extract only the URLs from Lynx’s output.
We could stop there and just run it as this, which will send the output to the screen:
$ lynx -dump "http://www.example.com" | grep -o "http:.*"
But it would be nice to save the output for later. To save the output to a file, just add the > symbol. In this case the output is being directed to a file named file.txt as shown below.
$ lynx -dump "http://www.example.com" | grep -o "http:.*" >file.txt
Other Options
Here is an example of some other options that you can add. The command sort sorts the results, and uniq removes any duplicate entries.
$ lynx -dump "http://www.example.com" | grep -o "http:.*" | sort | uniq >file.txt