An Async Html cache – Part I - Writing the cache

-

Other posts:

In the process of con­vert­ing a fi­nan­cial VBA Excel Addin to .NET (more on that in later posts), I found my­self in dire need of a HTML cache that can be called from mul­ti­ple threads with­out block­ing them. Visualize it as a glo­ri­fied dic­tio­nary where each en­try is (url, cached­Html). The only dif­fer­ence is that when you get the page, you pass a call­back to be in­voked when the html has been loaded (which could be im­me­di­ately if the html had al­ready been re­trieved by some­one else).

In essence, I want this:

    Public Sub GetHtmlAsync(ByVal url As String, ByVal callback As Action(Of String))

I’m not a big ex­pert in the .Net Parallel Extensions, but I’ve got help. Stephen Toub helped so much with this that he could have blogged about it him­self. And, by the way, this code runs on Visual Studio 2010, which we haven’t shipped yet. I be­lieve with some mod­i­fi­ca­tions, it can be run in 2008 + .Net Parallel Extensions CTP, but you’ll have to change a bunch of names.

In any case, here it comes. First, let’s add some im­ports.

Imports System.Collections.Concurrent
Imports System.Threading.Tasks
Imports System.Threading
Imports System.Net

Then, let’s de­fine an asyn­chro­nous cache.

Public Class AsyncCache(Of TKey, TValue)

This thing needs to store the (url, html) pairs some­where and, luck­ily enough, there is an handy ConcurrentDictionary that I can use. Also the cache needs to know how to load a TValue given a TKey. In programmingese’, that means.

    Private _loader As Func(Of TKey, TValue)
    Private _map As New ConcurrentDictionary(Of TKey, Task(Of TValue))

I’ll need a way to cre­ate it.

    Public Sub New(ByVal l As Func(Of TKey, TValue))
        _loader = l
    End Sub

Notice in the above code the use of the Task class for my dic­tio­nary in­stead of TValue. Task is a very good ab­strac­tion for do some work asyn­chro­nously and call me when you are done”. It’s easy to ini­tial­ize and it’s easy to at­tach call­backs to it. Indeed, this is what we’ll do next:

    Public Sub GetValueAsync(ByVal key As TKey, ByVal callback As Action(Of TValue))
        Dim task As Task(Of TValue) = Nothing
        If Not _map.TryGetValue(key, task) Then
            task = New Task(Of TValue)(Function() _loader(key), TaskCreationOptions.DetachedFromParent)
            If _map.TryAdd(key, task) Then
                task.Start()
            Else
                task.Cancel()
                _map.TryGetValue(key, task)
            End If
        End If
        task.ContinueWith(Sub(t) callback(t.Result))
    End Sub

Wow. Ok, let me ex­plain. This method is di­vided in two parts. The first part is just a thread safe way to say give me the task cor­re­spond­ing to this key or, if the task has­n’t been in­serted in the cache yet, cre­ate it and in­sert it”. The sec­ond part just says add call­back to the list of func­tions to be called when the task has fin­ished run­ning”.

The first part needs some more ex­pla­na­tion. What is TaskCreationOptions.DetachedFromParent? It es­sen­tially says that the cre­ated task is not go­ing to pre­vent the par­ent task from ter­mi­nat­ing. In essence, the task that cre­ated the child task won’t wait for its con­clu­sion. The rest is bet­ter ex­plained in com­ments.

        If Not _map.TryGetValue(key, task) Then ' Is the task in the cache? (Loc. X)
            task = New Task(Of TValue)(Function() _loader(key), TaskCreationOptions.DetachedFromParent) ' No, create it
            If _map.TryAdd(key, task) Then ' Try to add it
                task.Start() ' I succeeded. I’m the one who added this task. I can safely start it.
            Else
                task.Cancel() ' I failed, someone inserted the task after I checked in (Loc. X). Cancel it.
                _map.TryGetValue(key, task) ' And get the one that someone inserted
            End If
        End If

Got it? Well, I ad­mit I trust Stephen that this is what I should do …

I can then cre­ate my lit­tle HTML Cache by us­ing the above class as in:

Public Class HtmlCache

    Public Sub GetHtmlAsync(ByVal url As String, ByVal callback As Action(Of String))
        _asyncCache.GetValueAsync(url, callback)
    End Sub
    Private Function LoadWebPage(ByVal url As String) As String
        Using client As New WebClient()
            'Test.PrintThread("Downloading on thread {0} ...")
            Return client.DownloadString(url)
        End Using
    End Function
    Private _asyncCache As New AsyncCache(Of String, String)(AddressOf LoadWebPage)
End Class

I have no idea why col­or­ing got dis­abled when I copy/​paste. It does­n’t mat­ter, this is triv­ial. I just cre­ate an AsyncCache and ini­tial­ize it with a method that knows how to load a web page. I then sim­ply im­ple­ment GetHtmlAsync by del­e­gat­ing to the un­der­ly­ing GetValueAsync on AsyncCache.

It is some­how bizarre to call Webclient.DownloadString, when the de­sign could be re­vised to take ad­van­tage of its asyn­chro­nous ver­sion. Maybe I’ll do it in an­other post. Next time, I’ll write code to use this thing.

Tags

6 Comments

Comments

It would be much eas­ier to use a nor­mal thread safe col­lec­tion class.
Each el­e­ment would have:
 key
 url (string)
 status (loaded, failed, wait­ing to load, par­tially loaded)
 last sta­tus change (date/time)
 html_loaded (string)
 last_referenced (date/time)
 can_timeout_and_be_deleted(boolean)
Class meth­ods
  Get HTML from URL(boolean lookup_only = false, int max_block­_sec­onds = 0 /* -1 block for­ever, 0 - don’t block, oth­er­wise block for X sec­onds*/)
  Get HTML from KEY(boolean lookup_only = false)
  Delete_entry(URL)
  Delete_entry(KEY)
A thread or threads in­ter­nal to the class would load the html asy­chro­nously and be in­voked via a clock timer with ticks a few sec­onds apart.
Attaching a call­back for each re­quest is much harder to im­ple­ment.  It is upto the method re­quest­ing the URL to de­cide whether or not it blocks, needs an asy­chro­nous call­back/​in­ter­rupt or polls for data.  
The idea is that for nearly all cases, no new threads should be cre­ated and no new call­backs should be hooked up.  This keeps your code eas­ier to un­der­stand and de­bug.  Common faults and sce­nar­ios are han­dled eas­ily:
 - re­quest­ing thread ter­mi­nates
 - asyn­chro­nous load times out
 - er­ror load­ing html
 - html has­n’t been used for 5 min­utes and can be re­moved (a tun­able cache pa­ra­me­ter)
 - mem­ory limit of cache reached and un­ref­er­enced html strings can be re­moved (a tun­able cache pa­ra­me­ter)
 - du­pli­cate re­quest for a URL/KEY from more than one thread
 - html can be loaded from mul­ti­ple sources (web, file, net­work share, ftp, data­base, etc.).
 - html load failed as html string ex­ceeds the size limit on loaded string (e.g., a tun­able cache pa­ra­me­ter)
 - The com­mon prob­lem with at­tempt­ing a call­back for a method that is ter­mi­nated is avoided.  That’s a prob­lem when the call­back re­quires the cache to build a com­plex packet of data to pass in the call­back.
This is quite sim­i­lar to ba­sic page han­dling al­go­rithm in a vir­tual mem­ory sys­tem (circa 1980).  It’s how one han­dled this in sys­tems lack­ing real thread­ing or with non-reen­trant GUI mes­sage han­dling (VB6 GUI/MFC GUI post­ing a mes­sage to the cur­rent win­form in­di­ci­at­ing asyn­chro­nous re­quest com­pleted).

Thanks Greg, these are good com­ments.
We have a dif­fer­ent de­sign goal though. Both so­lu­tions are valid. I want the method re­quest­ing the URL to have the flex­i­bil­ity of de­cid­ing what to do (aka have a call­back). I do want the ex­posed API to be async.
The rest of your com­ments talk to the dif­fer­ence be­tween writ­ing pro­duc­tion code and a con­cep­tual ex­am­ple. I’m do­ing the lat­ter here.

The idea of wrap­ping the asy­chro­nous cache han­dler in a class is to re­duce or elim­i­nate the need for callers to bbe asy­chro­nous.  This makes cod­ing the caller’s class much eas­ier.
The other as­pect is that the amount of work done in an asy­chro­nous call back should be min­i­mal since you don’t know when it will be ex­e­cuted.  For ex­am­ple, you get a call­back call with the HTML you need whilst you are de­stroy­ing the caller’s ob­ject.  This is more im­por­tant when deal­ing with large amounts of data in each cach en­try (e.g., large xml strings) since pro­cess­ing each cache en­try may take con­sid­er­able time.

The Visual Basic Team

2009-04-29T15:15:28Z

You may know Luca Bolognese from his well-known work on C# LINQ. Luca is now the Group Program Manager

Luca Bolognese's WebLog

2009-05-08T11:53:22Z

Other posts: Part I — Writing the cache Let’s try out our lit­tle cache. First I want to write a syn­chro­nous