As a kid, I was fascinated with secret codes. After seeing the movie, "The Miracle Worker" (every kid in the 60s and 70s saw that movie at school, I'm sure), my friends and I learned how to use the American Sign Language finger spelling (visit http://where.com/scott.net/asl/abc.html to learn more) and spent months sending secret messages across the classroom while the teacher wasn't looking. For a while, my friends in religious school transliterated English words using the Hebrew alphabet as our own secret code. Yes, it's true?we were truly the geeks of the 60s. It's a tough job, too, because the Hebrew alphabet doesn't quite match up to the English alphabet, but nothing deterred the spies in training.
Maybe this is why hashing algorithms have always intrigued me. Back in the 80s, when I was teaching Data Structures using Pascal at Harvard Extension?teaching adults is way easier than teaching teenagers?I loved it when we got to the hashing section. I even love the name (yummmm... "hash"; you'll just have to imagine that Homer Simpson voice I hear in my head). In case you've missed the fun, hashing takes a piece of data and converts it to some other form, normally an array of bytes or simply an integer, in such a way that the process is deterministic. That is, if you hash a given value multiple times, you always get the same output. On the other hand, hashing also provides no return route. Because two values might result in the same hash value, there's no way to determine the original data given its hashed value.
In my .NET experience, developers use hashing for two very different purposes. First, the concept of hash searching comes into play when you have a number of objects you'd like to store somewhere in memory, and later be able to retrieve one of the objects given a particular piece of information about the object (somewhat like a primary key value, when you store data into a database table). This concept plays out in the use of the HashTable data structure, and the GetHashCode method supplied by the base Object type (and therefore, each and every object in the .NET Framework). Developers also use hashing in order to store encrypted values for comparison, such as passwords. For example, you might store a hashed password value in a database table, and when asked to validate a user, compare the hashed password against the hashed text the user has entered. If the values match, you know you've got the correct password. (This explains why many Web sites are unable to send you your current password; the best they can do is reset your password for you. Because they're only storing a hashed version of your password, they have no mechanism for sending you the original value. They can't retrieve it.)
The base Object class provides the GetHashCode method and most classes override its behavior with some value that's meaningful for that class' data. The .NET documentation is pretty clear on how this method works. It indicates that for any class, the GetHashCode method must have the following characteristics (and I quote):
- If two objects of the same type represent the same value, the hash function must return the same constant value for either object.
- For the best performance, a hash function must generate a random distribution for all input.
- The hash function must return exactly the same value regardless of any changes that are made to the object.
The GetHashCode method provided by the String class, for example, returns an integer that represents some manipulation of the text stored within the string. The following code shows off the feature:
Dim x As String = "Hello"
' Displays 222703111
Debug.WriteLine(x.GetHashCode)
x = "Goodbye"
' Displays -1207123720
Debug.WriteLine(x.GetHashCode)
x = "Hello"
' Displays 222703111
Debug.WriteLine(x.GetHashCode)
The GetHashCode method for a more complex type might base the hash value off of one or more fields or properties of an instance, or might provide a hash based on the address of the object.
Developers use hash tables for all sorts of purposes. For example, I recently got a call from a friend with this problem: He needed to be able to parse a text file, he needed to keep track of all the words in the file, and he needed to know how many times each word appeared. (This isn't a new problem. This sort of text concordance is something every developer in school has written.) Hash tables are perfect for this sort of operation.
You can think of a hash table as an array, containing buckets in which you store data. The data structure uses the hash value (that is, the integer 222703111 for the string "Hello") modulo the size of the hash table to determine the position in the table that the value will be stored. If you're writing the hash table algorithm yourself, you would need to worry about how the calculations are done, and you'd also need to consider what happens when two values "hash" to the same integer value and cause a collision?and that's sure to happen sooner or later?but when you use the System.Collections.HashTable data structure, you needn't worry about such details; the .NET Framework handles all these issues. You add a value to the hash table using the Add method, specifying the key to be hashed along with the data to be stored. You can call the Contains method to determine if a specific key has already been added. Because the HashTable class implements the IEnumerable interface, you can iterate through the collection. Each item within the hash table is a DictionaryEntry type, and you can retrieve the Key or Value property of the DictionaryEntry to get back the data you stored.
If you want to try this yourself, first create a class that contains a string and an integer that will contain the text and the number of times each has appeared. Override the GetHashCode and ToString methods so that your class looks like this (I've used fields rather than properties, and added the ToString override, to keep things simple for this demonstration):
Public Class TextData
Public Value As String
Public Count As Integer
Public Sub New(ByVal Value As String)
Me.Value = Value
Me.Count = 1
End Sub
Public Overrides Function GetHashCode() _
As Integer
Return Value.GetHashCode()
End Function
Public Overrides Function ToString() As String
Return Me.Value & ":" & Me.Count
End Function
End Class
Add a method like the following to your project. The ParseFile procedure allows you to specify a file name, and returns the completed HashTable object:
Private Function ParseFile( _
ByVal fileName As String) _
As Hashtable
' Given a file name, parse the words out into
' a hash table and return it.
Dim data As TextData
Dim table As New Hashtable
Dim reader As New StreamReader(fileName)
Dim text As String = reader.ReadToEnd()
' Do some very cheap parsing.
' This could really benefit
' from regular expressions!
Dim words() As String = _
text.Split(" ,.()[]{}".ToCharArray)
For Each word As String In words
If table.Contains(word) Then
' If the value is already in the
' hash table, just increment its Count
' property.
data = DirectCast(table(word), TextData)
data.Count += 1
Else
' The word isn't yet in the hash table.
' Create the TextData instance,
data = New TextData(word)
table.Add(word, data)
End If
Next
Return table
End Function
ParseFile does its work by first declaring the variables it will need, and then loads the text from the specified text file:
Dim data As TextData
Dim table As New Hashtable
Dim reader As New StreamReader(fileName)
Dim text As String = reader.ReadToEnd()
The code performs some very cheap (that is, neither elegant nor accurate) text parsing, using the Split method, and then loops through all the words in the array it creates:
Dim words() As String = _
text.Split(" ,.()[]{}".ToCharArray)
For Each word As String In words
' Code removed here...
Next
Within the loop, the code checks each word to see if it's already in the hash table, and if so, retrieves the corresponding TextValue object and increments its Count property:
If table.Contains(word) Then
' If the value is already in the
' hash table, just increment its Count
' property.
data = DirectCast(table(word), TextData)
data.Count += 1
Else
' Code removed here...
End If
If the word wasn't already in the HashTable, the code creates a new TextData object and adds it:
If table.Contains(word) Then
' Code removed here...
Else
' The word isn't yet in the hash table.
' Create the TextData instance,
data = New TextData(word)
table.Add(word, data)
End If
The procedure completes by returning the filled-in HashTable.
To test the ParseFile function, you could write code that passes it a file name and then loops through all the DictionaryEntry values in the returned HashTable, like this:
Dim table As Hashtable = _
ParseFile("C:\sessions.txt")
For Each de As DictionaryEntry In table
Debug.WriteLine(de.Value.ToString())
Next
If you wanted to retrieve just a specific item from the HashTable structure, you can index into the data using the key value, like this:
If table.Contains("Windows") Then
Debug.WriteLine( _
CType(table("Windows"), TextData).Count)
End If
The Item method (the default method in Visual Basic, and the indexer in C#) returns the value associated with the supplied key, as an Object type. It's up to you to cast it as the type you need, and then you can retrieve any property of the object.
As you can see, using hashing and the HashTable class provided by the .NET Framework makes it simple to store values that you might later need to retrieve. Although we haven't talked much about storing secrets, using hash searching is a valid use for hashing. Next time, we'll investigate how you can use hashing to allow you to hide data, as well. Stay tuned!