Downloading stock prices in F# - Part II - Html scraping
Luca Bolognese -Other parts:
- Part I - Data modeling
- Part III - Async loader for prices and divs
- Part IV - Async loader for splits
- Part V - Adjusting historical data
- Part VI - Code posted
Getting stock prices and dividends is relatively easy given that, on Yahoo, you can get the info as a CVS file. Getting the splits info is harder. You would think that Yahoo would put that info in the dividends CVS as it does when it displays it on screen, but it doesn’t. So I had to write code to scrap it from the multiple web pages where it might reside. In essence, I’m scraping this.
html.fs
In this file there are utility functions that I will use later on to retrieve split info.
#light open System open System.IO open System.Text.RegularExpressions // It assumes no table inside table ... let tableExpr = "<table[^>]*>(.*?)</table>" let headerExpr = "<th[^>]*>(.*?)</th>" let rowExpr = "<tr[^>]*>(.*?)</tr>" let colExpr = "<td[^>]*>(.*?)</td>" let regexOptions = RegexOptions.Multiline ||| RegexOptions.Singleline
||| RegexOptions.IgnoreCase
This code is straightforward enough (if you know what Regex does). I’m sure that there are better expression to scrap tables and rows on the web, but these work in my case. I really don’t need to scrape tables. I put the table expression there in case you need it.
I then write code to scrape all the cells in a piece of html:
let scrapHtmlCells html = seq { for x in Regex.Matches(html, colExpr, regexOptions) -> x.Groups.Item(1).ToString()}
This is a sequence expression. Sequence expressions are used to generate sequences starting from some expression (as the name hints to). In this case Regex.Matches returns a MatchClollection, which is a non-generic IEnumerable. For each element in it, we return the value of the first match. We could as easily have constructed a list or an array, given that there is not much deferred computation going on. But oh well
Always check the type of your functions in F#! With type inference it is easy to get it wrong. Hovering your mouse on top of it in VS shows it. This one is typed: string -> seq
We’ll need rows as well.
let scrapHtmlRows html = seq { for x in Regex.Matches(html, rowExpr, regexOptions) -> scrapHtmlCells x.Value }
This works about the same. I’m matching all the rows and retrieving the cells for each one of them. I’m getting back a matrix-like structure, that is to say that this function as type: string -> seq<seq
That’s all for today. In the next installment we’ll make it happen.
Tags
- FSHARP
5 Comments
Comments
Luca Bolognese's WebLog : Down
2008-09-05T14:42:49ZPingBack from http://blogs.msdn.com/lucabol/archive/2008/08/29/downloading-stock-prices-in-f-part-i-data-modeling.aspx
Luca Bolognese's WebLog
2008-09-12T16:18:11ZOther parts: Part I - Data modeling Part II - Html scraping It is now time to load our data. There is
Luca Bolognese's WebLog
2008-09-19T17:59:39ZOther parts: Part I - Data modeling Part II - Html scraping Part III - Async loader for prices and divs
Luca Bolognese's WebLog
2008-09-26T16:04:19ZOther parts: Part I - Data modeling Part II - Html scraping Part III - Async loader for prices and divs
Luca Bolognese's WebLog
2008-10-20T18:45:51ZOther parts: Part I - Data modeling Part II - Html scraping Part III - Async loader for prices and divs