# VBForums UtilityBank > UtilityBank - Tutorials >  The Basic Principals of Lexical Analysis

## dday9

*Prerequisites*
You must have a solid understanding of the following:
The principles of your desired programming language.Be familiar with common programming terms e.g. iterate, conditional statement, etc.If you are not using Visual Basic .NET then you should understand how the code I present to you translates into your language.

*Terminology*
I will be using some terms throughout this tutorial that are not used in everyday coding so I would like to define how I will be using them:
_Lexeme_  The most basic unit in a given source code e.g. in the source code: 

```
Dim foo As Integer = 1
```

Each word, symbol, and number represents a lexeme._Token_  A key/value pair where the key is typically a rule name and the value is typically a lexeme e.g. in the source code: 

```
Dim foo As Integer = 1
```

The equal sign would likely be a token where the key is assignment and the value is 1.

*Prefix*
Programming Language: Visual Basic .NET using the .NET framework 4.5
Target Language: JSON
Why JSON:
JSONs syntax make it one of the easiest languages to scan/parse.The syntax will likely stay contant for a very long time.

*Lexical Analysis - Basics*
The process of lexical analysis, also known as scanning, is to read the source code and break up the lexemes into meaningful tokens.

The first step in the lexical analysis phase is create a Token class. This class can be as simple as two String properties: _Category_ and _Value_.

The _Category_ property will represent the group that a lexeme belongs to.

The _Value_ property will represent the lexeme value found.

*Token*


```
Public Class Token

    Private _category As String
    Public Property Category As String
        Get
            Return _category
        End Get
        Set(ByVal value As String)
            If Not String.Equals(_category, value) Then
                _category = value
            End If
        End Set
    End Property

    Private _value As String
    Public Property Value As String
        Get
            Return _value
        End Get
        Set(ByVal value As String)
            If Not String.Equals(_value, value) Then
                _value = value
            End If
        End Set
    End Property

    Protected Overridable Sub OnCategoryChanged()
        RaiseEvent CategoryChanged(Me, EventArgs.Empty)
    End Sub

    Protected Overridable Sub OnValueChanged()
        RaiseEvent ValueChanged(Me, EventArgs.Empty)
    End Sub

    Public Event CategoryChanged(ByVal sender As Object, ByVal e As EventArgs)
    Public Event ValueChanged(ByVal sender As Object, ByVal e As EventArgs)

    Public Sub New()
        _category = String.Empty
        _value = String.Empty
    End Sub

    Public Sub New(ByVal category As String, ByVal value As String)
        _category = category
        _value = value
    End Sub

End Class
```

*Lexical Analysis - Categories*
Now that there is a structure in place to hold the tokens we need to create functions that will match lexemes based by category.

In JSON there are 5 different categories:
StringNumberBooleanSymbolsWhitespace

The following image is a railroad diagram of a String as defined by the JSON language:


This means that we must match any Unicode character except for escaped characters and any escaped characters. The following are all escaped characters which means that they are preceded by a reverse solidus:
"\/bfnrtu

The simplest way to match a JSON String is to use regular expressions. The regular expression pattern that I use is: 

```
"([^"\\]|\\[\\\/bfnrt])*
```

The function used will have a single ByRef String parameter to represent the source code that we have left to scan.

Inside the function we will check for if the RegEx is a match and if the match was at the beginning of the String. If the match meets the conditions then return a Token with the category _string_ and the value equal to the everything between the opening and closing quotes.

Finally we will shorten the source parameter by the length of the matched String.


```
Private Function Scan_String(ByRef source As String) As Token
    Dim m As Match = Regex.Match(source, """([^""\\]|\\[""\\\/bfnrt])*""")
    Dim t As Token = Nothing

    If m.Success AndAlso m.Index = 0 Then
        Dim literal As String = m.Value.Remove(0, 1)
        literal = literal.Remove(literal.Length - 1)
        t = New Token("string", literal)
        source = source.Substring(m.Value.Length)
    End If

    Return t
End Function
```

The following image is a railroad diagram of a Number as defined by the JSON language:


This means that we can have an optional negative sign, followed by any digit any number of times. We can also optionally have a single decimal place followed by any digit 1 or more times. We can also optionally have the letter e or E followed by an optional + or  sign followed by any digit 1 or more times.

The simplest way to match a JSON Number is to use regular expressions. The regular expression pattern that I use is: -?\d+((\.\d+)?([eE][+-]?\d+)?)

The function used will have a single ByRef String parameter to represent the source code that we have left to scan.

Inside the function we will check for if the RegEx is a match and if the match was at the beginning of the String. If the match meets the conditions then return a Token with the category _number_ and the value equal to the match.

Finally we will shorten the source parameter by the length of the matched String.


```
Private Function Scan_Number(ByRef source As String) As Token
    Dim m As Match = Regex.Match(source, "-?\d+((\.\d+)?([eE][+-]?\d+)?)")
    Dim t As Token = Nothing

    If m.Success AndAlso m.Index = 0 Then
        t = New Token(number", m.Value)
        source = source.Substring(m.Value.Length)
    End If

    Return t
End Function
```

The following image is a railroad diagram of a Boolean as defined by the JSON language:

This means that we need to match the literal values:
TrueFalseNull

While you could match JSON Boolean values using regular expressions, it is much more efficient to match the literals using built in comparison methods

The function used will have a single ByRef String parameter to represent the source code that we have left to scan.

Inside the function we will check for if the substring is any of the Boolean literals. If the match meets the conditions then return a Token with the category _boolean_ and the value equal to the match.

Finally we will shorten the source parameter by the length of the matched String.


```
Private Function Scan_Boolean(ByRef source As String) As Token
    Dim t As Token = Nothing

    If source.Length > 0 AndAlso source.Length >= 4 AndAlso (source.Substring(0, 4).Equals(true, StringComparison.OrdinalIgnoreCase) OrElse source.Substring(0, 4).Equals(null, StringComparison.OrdinalIgnoreCase)) Then
        t = New Token(boolean", source.Substring(0, 4))
        source = source.Substring(4)
    ElseIf source.Length > 0 AndAlso source.Length >= 5 AndAlso source.Substring(0, 5).Equals(false, StringComparison.OrdinalIgnoreCase) 
        t = New Token(boolean", source.Substring(0, 4))
        source = source.Substring(4)
    End If

    Return t
End Function
```

The following image is a railroad diagram of a Symbol as defined by the JSON language:


This means that we need to match the literal values:
[]{}:,

While you could match JSON symbol values using regular expressions, it is much more efficient to match the literals using built in comparison methods

The function used will have a single ByRef String parameter to represent the source code that we have left to scan.

Inside the function we will check for if the index of an array containing any of the symbol literals is not -1 for the first letter in the source code. If the match meets the conditions then return a Token with the category _symbol_ and the value equal to the match.

Finally we will shorten the source parameter by the length of the matched String.


```
Private Function Scan_Symbol(ByRef source As String) As Token
    Dim t As Token = Nothing
    Dim symbols() As Char = "[]{}:,".ToCharArray()

    If source.Length > 0 AndAlso Array.IndexOf(symbols, source(0)) <> -1 Then
        t = New Token(symbol", source(0))
        source = source.Remove(0, 1)
    End If

    Return t
End Function
```

Whitespace is ignored in JSON. So what we will need to do is check if the first letter in the source code is either empty or whitespace and if so then remove the first letter from the source code without returning anything. However, you will still want to have the return type of the function as a Token and I will expand on why a little later.


```
Private Function Scan_Whitespace(ByRef source As String) As Token
    If source.Length > 0 AndAlso String.IsNullOrWhitespace(source(0)) Then
        source = source.Remove(0, 1)
    End If

    Return Nothing
End Function
```

*Lexical Analysis - Scanning*
Now that all the definitions are defined as functions, the next step is to loop through the source code until the length of the source code is 0.

This will be done by creating a function that returns a fixed size collection of Tokens that has a single ByVal String parameter that represents the source code.

Inside the function, use a Do/Until loop to loop until the length of the source code is 0. The following code is the shell for our function:


```
Private Function Scan(ByVal source As String) As Token()
    Do

    Loop Until source.Length = 0
End Function
```

Inside the Do/Until loop we will want to check if the remaining source code matches any of the definitions we just created and the reason why I suggested making the return type of the _whitespace_ definition a Token is because in Visual Basic .NET we can create a Delegate Function that groups all the definitions together. The following code should be placed outside of the function:


```
Private Delegate Function Scan_Definition(ByRef source As String) As Token
```

Now outside of the Do/Until loop in the Scan function, declare:
A dynamic sized collection variable to hold the scanned tokensA fixed sized collection variable to hold all the definitionsA placeholder token that will be reused within the Do/Until loop

The following code is still not the final product of the Scan function but should be reflective of the code used so far:


```
Private Function Scan(ByVal source As String) As Token()
    Dim lexedTokens As New List(Of Token)
    Dim definitions() As Scan_Definition = {New Scan_Definition(AddressOf Scan_String), New Scan_Definition(AddressOf Scan_Number), New Scan_Definition(AddressOf Scan_Boolean), New Scan_Definition(AddressOf Scan_Symbol), New Scan_Definition(AddressOf Scan_Whitespace)}
    Dim t As Token = Nothing

    Do

    Loop Until source.Length = 0
End Function
```

Now inside of the Do/Until loop, loop through each definition and assign the token placeholder equal to the result of the function. If the token is not a null result then add it to the collection to be returned.

Finally return the collection as a fixed sized collection. The following code is the final draft of the Scan function:


```
Private Function Scan(ByVal source As String) As Token()
    Dim lexedTokens As New List(Of Token)
    Dim definitions() As Scan_Definition = {New Scan_Definition(AddressOf Scan_String), New Scan_Definition(AddressOf Scan_Number), New Scan_Definition(AddressOf Scan_Boolean), New Scan_Definition(AddressOf Scan_Symbol), New Scan_Definition(AddressOf Scan_Whitespace)}
    Dim t As Token = Nothing

    Do
        For Each definition As Scan_Definition In definitions
            t = definition.Invoke(source)
            If t IsNot Nothing Then
                lexedTokens.Add(t)
            End If
        Next
    Loop Until source.Length = 0

    Return lexedTokens.ToArray()
End Function
```

*Notes*
As you can tell the process of lexical analysis is fairly simple, and this process that I have shown is universal. You could do the same thing for just about any other language that you need to scan.

One thing to point out is that there is no prioritization in JSON. In some language, there is an _identifier_ category in which can match several different categories if all else fails. In scenarios such as these, you would stop iterating through each definition and then let the Do/Until loop repeat to start the process over.

----------

