C Lexer 设计与实现

题目

C 语言词法分析程序的设计与实现

实验内容及要求

  1. 可以识别出用 C 语言编写的源程序中的每个单词符号,并以记号的形式输出每个单词符号。

  2. 可以识别并跳过源程序中的注释。

  3. 可以统计源程序中的语句行数、各类单词的个数、以及字符总数,并输出统计结果。

  4. 检查源程序中存在的词法错误,并报告错误所在的位置。

  5. 对源程序中出现的错误进行适当的恢复,使词法分析可以继续进行,对源程序进行一次扫描,即可检查并报告源程序中存在的所有词法错误。

实现方法要求

分别用以下两种方法实现。

方法 1:采用 C/C++ 作为实现语言,手工编写词法分析程序。(必做)

方法 2:编写 LEX 源程序,利用 LEX 编译程序自动生成词法分析程序。

程序设计说明

词法分析介绍

词法分析(lexical analysis),又称扫描(scanning)会读取字符流输入,将其划分为有意义的序列,称为词素,对于每个词素,词法分析器产生如下形式的元组输出,此结构称为词符(Token),又译记号:

$$ \left\langle \text {词符名}, \text {字段值} \right\rangle $$

词符的 词符名 是一个用于下阶段 句法分析(syntax analysis)的抽象符号, 字段值 (又译属性值)则指向一个符号表中的条目(entry)。

本程序的特色

  • 支持行数、列数显示,支持中日韩 CJK 字符。

    image-20211006153159486

  • 支持任意大的源码文件读取。下图是 sqlite 中 btree 的实现,大约一万行:

    因为是通过文件输出流进行读取,而非整个读入内存。

    image-20211006153758019

  • 支持一边读取、一边分析、一边输出。通过回调函数实现;内部使用 std::istream,不依赖文件流,理论上可以从网络流读取代码进行词法分析。

  • 支持错误恢复,能够定位出多个词法错误。下面是本程序的报错示例:

    image-20211006154218684

  • 支持预处理指令、注释等内容的定位。

  • 支持浮点数、指数、十六进制等各种数字。

    image-20211006154531431

  • 支持转义字符、续行符等各种特殊情况。

    能够判断转义是否有效:

    image-20211006154416190

  • 友好的错误提示。(如上图)

  • 支持多种日志级别。通过环境变量可以设置。

设计思路

针对 C 代码的词法分析,基本思路是在文件流中,逐字符读取,根据字符的类型,进入不同的匹配分支进行匹配。

词符类型

首先定义词符类型,通过阅读参考 C keywords - cppreference.com 得到:

include/token_type.h

namespace lb_lexer
{
enum token_type
{
  PAR_L, // (
  PAR_R, // )
  BRA_L, // [
  BRA_R, // ]
  CUR_L, // {
  CUR_R, // }
  COMMA, // ,
  DOT,   // .
  PLUS,  // +
  SEMIC, // ;
  SLASH, // /
  STAR,  // *
  COLON, // :
  PERCT, // %
  QUEST, // %

  INLCOM, // 行内注释 //
  BLKCOM, // 块注释 /* ... */

  PREPROC, // 预处理指令

  MINUS,  // -
  PTMEM,  //-> 指向成员
  BANG,   // !
  BANGE,  // !=
  EQUAL,  // =
  EQUALE, // ==
  GREAT,  // >
  GREATE, // >=
  LESS,   // <
  LESSE,  // <=
  CARET,  // ^
  CARETE, // ^=
  TILDE,  // ~
  TILDEE, // ~=
  AND,    // &
  AAND,   // &&
  OR,     // |
  OOR,    // ||
  PLUSE,  // +=
  MINUSE, // -=
  MULTPE, // *=
  DIVE,   // /=

  // 字面量
  ID,  // identifier
  CHR, // char 'x'
  STR, // string
  NUM, // number

  // 关键字.
  AUTO,
  BREAK,
  CASE,
  CHAR,
  CONST,
  CONTINUE,
  DEFAULT,
  DO,
  DOUBLE,
  ELSE,
  ENUM,
  EXTERN,

  FLOAT,
  FOR,
  GOTO,
  IF,
  INLINE,
  INT,
  LONG,
  REGISTER,
  RESTRICT,
  RETURN,
  SHORT,

  SIGNED,
  SIZEOF,
  STATIC,
  STRUCT,
  SWITCH,
  TYPEDEF,
  UNION,
  UNSIGNED,
  VOID,
  VOLATILE,
  WHILE,

  _ALIGNAS,
  _ALIGNOF,
  _ATOMIC,
  _BOOL,
  _COMPLEX,
  _DECIMAL128,
  _DECIMAL32,
  _DECIMAL64,
  _GENERIC,
  _IMAGINARY,
  _NORETURN,
  _STATIC_ASSERT,
  _THREAD_LOCAL,

  END_OF_FILE,
  LINE_BREAK,
  UNKNOWN
};
}

错误类型

定义如下错误类型:

include/scanner.h

enum lexical_etype
{
  UNEXPECTED_CHAR,
  UNTERMINATED_STRING,
  UNTERMINATED_CHAR,
  EMPTY_CHAR_LITERAL,
  UNTERMINATED_BLOCK_COMMENT,
  INVALID_ESCAPE_CHAR
};

词法分析

主要通过一个分支语句实现:

src/scanner.cpp

void
scanner::scan_token ()
{
  auto c = advance ();

  switch (c)
    {
    case EOF:
      spdlog::info ("EOF");
      finished = true;
      break;
    case '#':
      match_preproc ();
      break;
    case '(':
      fast_yield (token_type::PAR_L);
      break;
    case ')':
      fast_yield (token_type::PAR_R);
      break;
    case '{':
      fast_yield (token_type::CUR_L);
      break;
    case '}':
      fast_yield (token_type::CUR_R);
      break;
    case '[':
      fast_yield (token_type::BRA_L);
      break;
    case ']':
      fast_yield (token_type::BRA_R);
      break;
    case ',':
      fast_yield (token_type::COMMA);
      break;
    case '.':
      fast_yield (token_type::DOT);
      break;
    case '+':
      fast_yield (token_type::PLUS);
      break;
    case ':':
      fast_yield (token_type::COLON);
      break;
    case ';':
      fast_yield (token_type::SEMIC);
      break;
    case '*':
      fast_yield (token_type::STAR);
      break;
    case '%':
      fast_yield (token_type::PERCT);
      break;
    case '?':
      fast_yield (token_type::QUEST);
      break;
    case '-':
      fast_yield (match ('>') ? token_type::PTMEM : token_type::MINUS);
      break;
    case '&':
      fast_yield (match ('&') ? token_type::AAND : token_type::AND);
      break;
    case '|':
      fast_yield (match ('|') ? token_type::OOR : token_type::OR);
      break;
    case '!':
      fast_yield (match ('=') ? token_type::BANGE : token_type::BANG);
      break;
    case '=':
      fast_yield (match ('=') ? token_type::EQUALE : token_type::EQUAL);
      break;
    case '<':
      fast_yield (match ('=') ? token_type::LESSE : token_type::LESS);
      break;
    case '>':
      fast_yield (match ('=') ? token_type::GREATE : token_type::GREAT);
      break;
    case '~':
      fast_yield (match ('=') ? token_type::TILDE : token_type::TILDE);
      break;
    case '^':
      fast_yield (match ('=') ? token_type::CARETE : token_type::CARET);
      break;
    case '/':
      if (match ('/'))
        {
          match_inline_comment ();
        }
      else if (match ('*'))
        {
          match_block_comment ();
        }
      else
        {
          fast_yield (token_type::SLASH);
        }
      break;
    case ' ':
    case '\r':
    case '\t':
      // Ignore whitespace.
      break;
    case '\n':
      col = 0;
      line++;
      break;
    case '\'':
      match_char ();
      break;
    case '"':
      match_string ();
      break;
    default:
      if (is_digit (c))
        match_number ();
      else if (is_word (c))
        match_identifier ();
      else
        {
          if (on_error (lexical_etype::UNEXPECTED_CHAR))
            {
              // try recover
            }
          else
            {
              throw std::exception ();
            }
        }

      break;
    }
}

对于更复杂的词符,通过子程序实现:

include/scanner.h

  // 匹配标识符
  void match_identifier ();
  // 匹配数字
  void match_number ();
  // 匹配字符
  void match_char ();
  // 匹配字符串
  void match_string ();
  // 匹配行内注释
  void match_inline_comment ();
  // 匹配块注释
  void match_block_comment ();
  // 匹配预处理指令
  void match_preproc ();

定位通过如下变量实现:

include/scanner.h

  // 是否源码读取完毕
  bool finished;
  //buffer 起始位置指针
  size_t start;
  //buffer 当前位置指针
  size_t current;
  // 当前行
  size_t line;
  // 当前列(CJK 按 Unicode 拆为三个字符,能够保证字节数正常)
  size_t col;
  // 当前列(CJK 算一个字符)
  size_t wcol;
  // 已经读取的字节数,也即当前位置
  size_t pos;

处理转义通过如下程序实现:

src/scanner.cpp


void
scanner::handle_escape ()
{
  if (peek () == '\\')
    {
      switch (peek_next ())
        {
        case '\n': // 续行
          new_line ();
          advance ();
          break;
          /* 八进制 */
        case '0':
        case '1':
        case '2':
        case '3':
        case '4':
        case '5':
        case '6':
        case '7':
          advance ();
          break;
        // see https://en.wikipedia.org/wiki/Escape_sequences_in_C
        case 'a':  // Alert (bell, alarm)
        case 'b':  // Backspace
        case 'e':  // Escape character
        case 'f':  // Form feed (new page)
        case 'n':  // New-line
        case 'r':  // Carriage return
        case 't':  // Horizontal tab
        case 'v':  // Vertical tab
        case '\'': // Single quotation mark
        case '\"': // Double quotation mark
        case '?':  // Question mark
        case '\\': // Backslash
        case 'u':  // Unicode code point below 10000 hexadecimal (added in
                   // C99)[1]: 26 
        case 'U':  // Unicode code point where h is a hexadecimal digit
          advance ();
          break;
        default: /* Escaped character like \ ^ : = */
          if (!on_error (lexical_etype::INVALID_ESCAPE_CHAR))
            {
              throw std::exception ();
            }
          advance ();
          break;
        }
    }
}

源程序

见附件,另外本程序已开源,见 pluveto/lb_lexer (github.com)

源程序使用说明

开发环境准备

操作系统:以 Debian 11 为例,其它发行版同理。

Windows 参考 此处

安装系统依赖
$ sudo apt install cmake
安装包管理器 vcpkg
$ mkdir ~/app
$ cd ~/app
$ git clone https://github.com/microsoft/vcpkg
$ ./vcpkg/bootstrap-vcpkg.sh
$ sudo ln -s ./vcpkg/vcpkg /usr/bin/vcpkg

安装依赖

vcpkg install spdlog
配置 VSCode 环境

VSCode 需要安装 C/C++、CMake、CMake Tools 插件。(推荐直接安装全套插件 C/C++ Extension Pack)

.vscode/settings.json 追加如下配置:

{
    "cmake.buildArgs": [  "-DCMAKE_TOOLCHAIN_FILE:FILEPATH=$VCPKG_PATH/scripts/buildsystems/vcpkg.cmake"],
    "cmake.configureArgs": [  "-DCMAKE_TOOLCHAIN_FILE:FILEPATH=$VCPKG_PATH/scripts/buildsystems/vcpkg.cmake"],
    "C_Cpp.default.includePath": [
        "$VCPKG_PATH/installed/x64-linux/include",
    ],
    "C_Cpp.default.configurationProvider": "ms-vscode.cmake-tools",
    "cmake.configureOnOpen": true,
}

其中 $VCPKG_PATH 替换为你安装的 vcpkg 的绝对目录路径,如 /home/pluveto/app/.

编译和运行

执行 CMake Configure

打开 VSCode 命令面板(F1),执行 CMake Configure.

执行编译脚本

mr 是编写好的编译、运行脚本。使用 ./mr 执行。

如果权限不足,请 $ chmod +x mr

程序将会完成编译并运行。

可执行程序

可通过 CMake 编译。对于 Linux 系统的二进制文件见附件

测试报告

字符串、中文字符串、十六进制、小数、指数测试

dist/sample.c

#include <stdio.h>
#include <stdlib.h>
int
main ()
{
  // Master spark!
  /**
   * lalalalal
   *  ffffffffffffff
   */
  const char * str = "Hello, /*This is a test*/\r\n\t\"\"";
  const char * str_zh = "你好,这是一个测试程序";
  // 这是一句中文注释
  const char ch = '\\';
  char s2[] = "//\\";
  int x = 0x01;
  float a = 0.302;
  float b = -128.101;
  double c = 123;
  float d = 112.64E3;
  double e = 0.7623e-2;
  float f = 1.23002398;
  printf ("a=%e \nb=%f \nc=%lf \nd=%lE \ne=%lf \nf=%f\n", a, b, c, d, e, f);
  return 0;
}

输出:

[2021-10-06 15:59:00.069] [info] Options loaded
[2021-10-06 15:59:00.070] [info] Started
[2021-10-06 15:59:00.071] [info] File is open
[2021-10-06 15:59:00.072] [info] Start scanning
[2021-10-06 15:59:00.072] [info] sample.c:1:18  TYPE=PREPROC TEXT=#include <stdio.h>
[2021-10-06 15:59:00.077] [info] sample.c:2:19  TYPE=PREPROC TEXT=#include <stdlib.h>
[2021-10-06 15:59:00.078] [info] sample.c:3:3   TYPE=INT     TEXT=int         
[2021-10-06 15:59:00.079] [info] sample.c:4:4   TYPE=ID      TEXT=main        
[2021-10-06 15:59:00.080] [info] sample.c:4:6   TYPE=PAR_L   TEXT=(           
[2021-10-06 15:59:00.081] [info] sample.c:4:7   TYPE=PAR_R   TEXT=)           
[2021-10-06 15:59:00.082] [info] sample.c:5:1   TYPE=CUR_L   TEXT={           
[2021-10-06 15:59:00.083] [info] sample.c:6:18  Inline comment: // Master spark!
[2021-10-06 15:59:00.084] [info] sample.c:10:6          Block comment: 
/**
   * lalalalal
   *  ffffffffffffff
   */
[2021-10-06 15:59:00.084] [info] sample.c:11:7          TYPE=CONST   TEXT=const       
[2021-10-06 15:59:00.085] [info] sample.c:11:12         TYPE=CHAR    TEXT=char        
[2021-10-06 15:59:00.085] [info] sample.c:11:14         TYPE=STAR    TEXT=*           
[2021-10-06 15:59:00.085] [info] sample.c:11:18         TYPE=ID      TEXT=str         
[2021-10-06 15:59:00.086] [info] sample.c:11:20         TYPE=EQUAL   TEXT==           
[2021-10-06 15:59:00.087] [info] sample.c:11:58         TYPE=STR     STR="Hello, /*This is a test*/\r\n\t\"\""
[2021-10-06 15:59:00.087] [info] sample.c:11:59         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:59:00.087] [info] sample.c:12:7          TYPE=CONST   TEXT=const       
[2021-10-06 15:59:00.088] [info] sample.c:12:12         TYPE=CHAR    TEXT=char        
[2021-10-06 15:59:00.088] [info] sample.c:12:14         TYPE=STAR    TEXT=*           
[2021-10-06 15:59:00.088] [info] sample.c:12:21         TYPE=ID      TEXT=str_zh      
[2021-10-06 15:59:00.088] [info] sample.c:12:23         TYPE=EQUAL   TEXT==           
[2021-10-06 15:59:00.088] [info] sample.c:12:59         TYPE=STR     STR="你好,这是一个测试程序"
[2021-10-06 15:59:00.088] [info] sample.c:12:60         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:59:00.089] [info] sample.c:13:29         Inline comment: // 这是一句中文注释
[2021-10-06 15:59:00.089] [info] sample.c:14:7          TYPE=CONST   TEXT=const       
[2021-10-06 15:59:00.089] [info] sample.c:14:12         TYPE=CHAR    TEXT=char        
[2021-10-06 15:59:00.089] [info] sample.c:14:15         TYPE=ID      TEXT=ch          
[2021-10-06 15:59:00.089] [info] sample.c:14:17         TYPE=EQUAL   TEXT==           
[2021-10-06 15:59:00.089] [info] sample.c:14:22         TYPE=CHR     TEXT=            
[2021-10-06 15:59:00.089] [info] sample.c:14:23         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:59:00.089] [info] sample.c:15:6          TYPE=CHAR    TEXT=char        
[2021-10-06 15:59:00.089] [info] sample.c:15:9          TYPE=ID      TEXT=s2          
[2021-10-06 15:59:00.090] [info] sample.c:15:10         TYPE=BRA_L   TEXT=[           
[2021-10-06 15:59:00.090] [info] sample.c:15:11         TYPE=BRA_R   TEXT=]           
[2021-10-06 15:59:00.090] [info] sample.c:15:13         TYPE=EQUAL   TEXT==           
[2021-10-06 15:59:00.090] [info] sample.c:15:20         TYPE=STR     STR="//\\"      
[2021-10-06 15:59:00.090] [info] sample.c:15:21         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:59:00.090] [info] sample.c:16:5          TYPE=INT     TEXT=int         
[2021-10-06 15:59:00.090] [info] sample.c:16:7          TYPE=ID      TEXT=x           
[2021-10-06 15:59:00.090] [info] sample.c:16:9          TYPE=EQUAL   TEXT==           
[2021-10-06 15:59:00.090] [info] sample.c:16:14         TYPE=NUM     TEXT=0x01        
[2021-10-06 15:59:00.091] [info] sample.c:16:15         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:59:00.091] [info] sample.c:17:7          TYPE=FLOAT   TEXT=float       
[2021-10-06 15:59:00.091] [info] sample.c:17:9          TYPE=ID      TEXT=a           
[2021-10-06 15:59:00.091] [info] sample.c:17:11         TYPE=EQUAL   TEXT==           
[2021-10-06 15:59:00.091] [info] sample.c:17:17         TYPE=NUM     TEXT=0.302       
[2021-10-06 15:59:00.092] [info] sample.c:17:18         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:59:00.092] [info] sample.c:18:7          TYPE=FLOAT   TEXT=float       
[2021-10-06 15:59:00.092] [info] sample.c:18:9          TYPE=ID      TEXT=b           
[2021-10-06 15:59:00.092] [info] sample.c:18:11         TYPE=EQUAL   TEXT==           
[2021-10-06 15:59:00.093] [info] sample.c:18:13         TYPE=MINUS   TEXT=-           
[2021-10-06 15:59:00.093] [info] sample.c:18:20         TYPE=NUM     TEXT=128.101     
[2021-10-06 15:59:00.093] [info] sample.c:18:21         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:59:00.093] [info] sample.c:19:8          TYPE=DOUBLE  TEXT=double      
[2021-10-06 15:59:00.093] [info] sample.c:19:10         TYPE=ID      TEXT=c           
[2021-10-06 15:59:00.093] [info] sample.c:19:12         TYPE=EQUAL   TEXT==           
[2021-10-06 15:59:00.093] [info] sample.c:19:16         TYPE=NUM     TEXT=123         
[2021-10-06 15:59:00.093] [info] sample.c:19:17         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:59:00.093] [info] sample.c:20:7          TYPE=FLOAT   TEXT=float       
[2021-10-06 15:59:00.094] [info] sample.c:20:9          TYPE=ID      TEXT=d           
[2021-10-06 15:59:00.094] [info] sample.c:20:11         TYPE=EQUAL   TEXT==           
[2021-10-06 15:59:00.094] [info] sample.c:20:20         TYPE=NUM     TEXT=112.64E3    
[2021-10-06 15:59:00.094] [info] sample.c:20:21         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:59:00.094] [info] sample.c:21:8          TYPE=DOUBLE  TEXT=double      
[2021-10-06 15:59:00.094] [info] sample.c:21:10         TYPE=ID      TEXT=e           
[2021-10-06 15:59:00.094] [info] sample.c:21:12         TYPE=EQUAL   TEXT==           
[2021-10-06 15:59:00.095] [info] sample.c:21:22         TYPE=NUM     TEXT=0.7623e-2   
[2021-10-06 15:59:00.095] [info] sample.c:21:23         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:59:00.095] [info] sample.c:22:7          TYPE=FLOAT   TEXT=float       
[2021-10-06 15:59:00.095] [info] sample.c:22:9          TYPE=ID      TEXT=f           
[2021-10-06 15:59:00.095] [info] sample.c:22:11         TYPE=EQUAL   TEXT==           
[2021-10-06 15:59:00.095] [info] sample.c:22:22         TYPE=NUM     TEXT=1.23002398  
[2021-10-06 15:59:00.095] [info] sample.c:22:23         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:59:00.095] [info] sample.c:23:8          TYPE=ID      TEXT=printf      
[2021-10-06 15:59:00.095] [info] sample.c:23:10         TYPE=PAR_L   TEXT=(           
[2021-10-06 15:59:00.096] [info] sample.c:23:56         TYPE=STR     STR="a=%e \nb=%f \nc=%lf \nd=%lE \ne=%lf \nf=%f\n"
[2021-10-06 15:59:00.096] [info] sample.c:23:57         TYPE=COMMA   TEXT=,           
[2021-10-06 15:59:00.096] [info] sample.c:23:59         TYPE=ID      TEXT=a           
[2021-10-06 15:59:00.096] [info] sample.c:23:60         TYPE=COMMA   TEXT=,           
[2021-10-06 15:59:00.096] [info] sample.c:23:62         TYPE=ID      TEXT=b           
[2021-10-06 15:59:00.098] [info] sample.c:23:63         TYPE=COMMA   TEXT=,           
[2021-10-06 15:59:00.098] [info] sample.c:23:65         TYPE=ID      TEXT=c           
[2021-10-06 15:59:00.098] [info] sample.c:23:66         TYPE=COMMA   TEXT=,           
[2021-10-06 15:59:00.098] [info] sample.c:23:68         TYPE=ID      TEXT=d           
[2021-10-06 15:59:00.099] [info] sample.c:23:69         TYPE=COMMA   TEXT=,           
[2021-10-06 15:59:00.099] [info] sample.c:23:71         TYPE=ID      TEXT=e           
[2021-10-06 15:59:00.099] [info] sample.c:23:72         TYPE=COMMA   TEXT=,           
[2021-10-06 15:59:00.099] [info] sample.c:23:74         TYPE=ID      TEXT=f           
[2021-10-06 15:59:00.099] [info] sample.c:23:75         TYPE=PAR_R   TEXT=)           
[2021-10-06 15:59:00.099] [info] sample.c:23:76         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:59:00.099] [info] sample.c:24:8          TYPE=RETURN  TEXT=return      
[2021-10-06 15:59:00.099] [info] sample.c:24:10         TYPE=NUM     TEXT=0           
[2021-10-06 15:59:00.100] [info] sample.c:24:11         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:59:00.100] [info] sample.c:25:1          TYPE=CUR_R   TEXT=}           
[2021-10-06 15:59:00.100] [info] EOF
[2021-10-06 15:59:00.100] [info] Scan over

未结尾串、无效字符、无效注释、有效注释、续行符测试

  • bad1.c 测试未结尾串、空字符、续行符

dist/bad1.c

//bad1.c 测试未结尾串、空字符、续行符
const char * s0 = "string is here 
but not terminated";

const char * s1 = "another string is here \
but not terminated";

int c = ''

typedef (const char *) fixed_str;

fixed_str "你好,世界!"

/*
 Unterminated block comment test

运行结果:

[2021-10-06 15:44:54.775] [info] Options loaded
[2021-10-06 15:44:54.775] [info] Started
[2021-10-06 15:44:54.776] [info] File is open
[2021-10-06 15:44:54.777] [info] Start scanning
[2021-10-06 15:44:54.777] [info] sample.c:1:18  TYPE=PREPROC TEXT=#include <stdio.h>
[2021-10-06 15:44:54.778] [info] sample.c:2:19  TYPE=PREPROC TEXT=#include <stdlib.h>
[2021-10-06 15:44:54.779] [info] sample.c:3:3   TYPE=INT     TEXT=int         
[2021-10-06 15:44:54.780] [info] sample.c:4:4   TYPE=ID      TEXT=main        
[2021-10-06 15:44:54.780] [info] sample.c:4:6   TYPE=PAR_L   TEXT=(           
[2021-10-06 15:44:54.781] [info] sample.c:4:7   TYPE=PAR_R   TEXT=)           
[2021-10-06 15:44:54.782] [info] sample.c:5:1   TYPE=CUR_L   TEXT={           
[2021-10-06 15:44:54.783] [info] sample.c:6:18  Inline comment: // Master spark!
[2021-10-06 15:44:54.784] [info] sample.c:10:6          Block comment: 
/**
   * lalalalal
   *  ffffffffffffff
   */
[2021-10-06 15:44:54.784] [info] sample.c:11:7          TYPE=CONST   TEXT=const       
[2021-10-06 15:44:54.785] [info] sample.c:11:12         TYPE=CHAR    TEXT=char        
[2021-10-06 15:44:54.785] [info] sample.c:11:14         TYPE=STAR    TEXT=*           
[2021-10-06 15:44:54.785] [info] sample.c:11:18         TYPE=ID      TEXT=str         
[2021-10-06 15:44:54.785] [info] sample.c:11:20         TYPE=EQUAL   TEXT==           
[2021-10-06 15:44:54.786] [info] sample.c:11:58         TYPE=STR     STR="Hello, /*This is a test*/\r\n\t\"\""
[2021-10-06 15:44:54.786] [info] sample.c:11:59         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:44:54.786] [info] sample.c:12:7          TYPE=CONST   TEXT=const       
[2021-10-06 15:44:54.789] [info] sample.c:12:12         TYPE=CHAR    TEXT=char        
[2021-10-06 15:44:54.789] [info] sample.c:12:14         TYPE=STAR    TEXT=*           
[2021-10-06 15:44:54.789] [info] sample.c:12:21         TYPE=ID      TEXT=str_zh      
[2021-10-06 15:44:54.789] [info] sample.c:12:23         TYPE=EQUAL   TEXT==           
[2021-10-06 15:44:54.790] [info] sample.c:12:59         TYPE=STR     STR="你好,这是一个测试程序"
[2021-10-06 15:44:54.790] [info] sample.c:12:60         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:44:54.790] [info] sample.c:13:29         Inline comment: // 这是一句中文注释
[2021-10-06 15:44:54.790] [info] sample.c:14:7          TYPE=CONST   TEXT=const       
[2021-10-06 15:44:54.791] [info] sample.c:14:12         TYPE=CHAR    TEXT=char        
[2021-10-06 15:44:54.791] [info] sample.c:14:15         TYPE=ID      TEXT=ch          
[2021-10-06 15:44:54.791] [info] sample.c:14:17         TYPE=EQUAL   TEXT==           
[2021-10-06 15:44:54.792] [info] sample.c:14:22         TYPE=CHR     TEXT=            
[2021-10-06 15:44:54.792] [info] sample.c:14:23         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:44:54.792] [info] sample.c:15:6          TYPE=CHAR    TEXT=char        
[2021-10-06 15:44:54.792] [info] sample.c:15:9          TYPE=ID      TEXT=s2          
[2021-10-06 15:44:54.792] [info] sample.c:15:10         TYPE=BRA_L   TEXT=[           
[2021-10-06 15:44:54.792] [info] sample.c:15:11         TYPE=BRA_R   TEXT=]           
[2021-10-06 15:44:54.792] [info] sample.c:15:13         TYPE=EQUAL   TEXT==           
[2021-10-06 15:44:54.792] [info] sample.c:15:20         TYPE=STR     STR="//\\"      
[2021-10-06 15:44:54.792] [info] sample.c:15:21         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:44:54.792] [info] sample.c:16:5          TYPE=INT     TEXT=int         
[2021-10-06 15:44:54.792] [info] sample.c:16:7          TYPE=ID      TEXT=x           
[2021-10-06 15:44:54.792] [info] sample.c:16:9          TYPE=EQUAL   TEXT==           
[2021-10-06 15:44:54.792] [info] sample.c:16:14         TYPE=NUM     TEXT=0x01        
[2021-10-06 15:44:54.793] [info] sample.c:16:15         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:44:54.797] [info] sample.c:17:7          TYPE=FLOAT   TEXT=float       
[2021-10-06 15:44:54.797] [info] sample.c:17:9          TYPE=ID      TEXT=a           
[2021-10-06 15:44:54.797] [info] sample.c:17:11         TYPE=EQUAL   TEXT==           
[2021-10-06 15:44:54.797] [info] sample.c:17:17         TYPE=NUM     TEXT=0.302       
[2021-10-06 15:44:54.797] [info] sample.c:17:18         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:44:54.797] [info] sample.c:18:7          TYPE=FLOAT   TEXT=float       
[2021-10-06 15:44:54.798] [info] sample.c:18:9          TYPE=ID      TEXT=b           
[2021-10-06 15:44:54.798] [info] sample.c:18:11         TYPE=EQUAL   TEXT==           
[2021-10-06 15:44:54.798] [info] sample.c:18:13         TYPE=MINUS   TEXT=-           
[2021-10-06 15:44:54.798] [info] sample.c:18:20         TYPE=NUM     TEXT=128.101     
[2021-10-06 15:44:54.798] [info] sample.c:18:21         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:44:54.798] [info] sample.c:19:8          TYPE=DOUBLE  TEXT=double      
[2021-10-06 15:44:54.799] [info] sample.c:19:10         TYPE=ID      TEXT=c           
[2021-10-06 15:44:54.799] [info] sample.c:19:12         TYPE=EQUAL   TEXT==           
[2021-10-06 15:44:54.799] [info] sample.c:19:16         TYPE=NUM     TEXT=123         
[2021-10-06 15:44:54.799] [info] sample.c:19:17         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:44:54.799] [info] sample.c:20:7          TYPE=FLOAT   TEXT=float       
[2021-10-06 15:44:54.799] [info] sample.c:20:9          TYPE=ID      TEXT=d           
[2021-10-06 15:44:54.799] [info] sample.c:20:11         TYPE=EQUAL   TEXT==           
[2021-10-06 15:44:54.800] [info] sample.c:20:20         TYPE=NUM     TEXT=112.64E3    
[2021-10-06 15:44:54.800] [info] sample.c:20:21         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:44:54.800] [info] sample.c:21:8          TYPE=DOUBLE  TEXT=double      
[2021-10-06 15:44:54.800] [info] sample.c:21:10         TYPE=ID      TEXT=e           
[2021-10-06 15:44:54.800] [info] sample.c:21:12         TYPE=EQUAL   TEXT==           
[2021-10-06 15:44:54.800] [info] sample.c:21:22         TYPE=NUM     TEXT=0.7623e-2   
[2021-10-06 15:44:54.800] [info] sample.c:21:23         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:44:54.800] [info] sample.c:22:7          TYPE=FLOAT   TEXT=float       
[2021-10-06 15:44:54.800] [info] sample.c:22:9          TYPE=ID      TEXT=f           
[2021-10-06 15:44:54.800] [info] sample.c:22:11         TYPE=EQUAL   TEXT==           
[2021-10-06 15:44:54.800] [info] sample.c:22:22         TYPE=NUM     TEXT=1.23002398  
[2021-10-06 15:44:54.800] [info] sample.c:22:23         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:44:54.800] [info] sample.c:23:8          TYPE=ID      TEXT=printf      
[2021-10-06 15:44:54.800] [info] sample.c:23:10         TYPE=PAR_L   TEXT=(           
[2021-10-06 15:44:54.800] [info] sample.c:23:56         TYPE=STR     STR="a=%e \nb=%f \nc=%lf \nd=%lE \ne=%lf \nf=%f\n"
[2021-10-06 15:44:54.800] [info] sample.c:23:57         TYPE=COMMA   TEXT=,           
[2021-10-06 15:44:54.800] [info] sample.c:23:59         TYPE=ID      TEXT=a           
[2021-10-06 15:44:54.800] [info] sample.c:23:60         TYPE=COMMA   TEXT=,           
[2021-10-06 15:44:54.800] [info] sample.c:23:62         TYPE=ID      TEXT=b           
[2021-10-06 15:44:54.800] [info] sample.c:23:63         TYPE=COMMA   TEXT=,           
[2021-10-06 15:44:54.800] [info] sample.c:23:65         TYPE=ID      TEXT=c           
[2021-10-06 15:44:54.800] [info] sample.c:23:66         TYPE=COMMA   TEXT=,           
[2021-10-06 15:44:54.800] [info] sample.c:23:68         TYPE=ID      TEXT=d           
[2021-10-06 15:44:54.800] [info] sample.c:23:69         TYPE=COMMA   TEXT=,           
[2021-10-06 15:44:54.801] [info] sample.c:23:71         TYPE=ID      TEXT=e           
[2021-10-06 15:44:54.801] [info] sample.c:23:72         TYPE=COMMA   TEXT=,           
[2021-10-06 15:44:54.801] [info] sample.c:23:74         TYPE=ID      TEXT=f           
[2021-10-06 15:44:54.801] [info] sample.c:23:75         TYPE=PAR_R   TEXT=)           
[2021-10-06 15:44:54.801] [info] sample.c:23:76         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:44:54.802] [info] sample.c:24:8          TYPE=RETURN  TEXT=return      
[2021-10-06 15:44:54.803] [info] sample.c:24:10         TYPE=NUM     TEXT=0           
[2021-10-06 15:44:54.803] [info] sample.c:24:11         TYPE=SEMIC   TEXT=;           
[2021-10-06 15:44:54.804] [info] sample.c:25:1          TYPE=CUR_R   TEXT=}           
[2021-10-06 15:44:54.804] [info] EOF
[2021-10-06 15:44:54.804] [info] Scan over
pluveto@devhost1:~/bupt-c-lexer/dist$ ./lb_lexer bad1.c 
[2021-10-06 15:54:39.922] [info] Options loaded
[2021-10-06 15:54:39.923] [info] Started
[2021-10-06 15:54:39.923] [info] File is open
[2021-10-06 15:54:39.924] [info] Start scanning
[2021-10-06 15:54:39.924] [info] bad1.c:1:52    Inline comment: //bad1.c 测试未结尾串、空字符、续行符
[2021-10-06 15:54:39.925] [info] bad1.c:2:5     TYPE=CONST   TEXT=const       
[2021-10-06 15:54:39.926] [info] bad1.c:2:10    TYPE=CHAR    TEXT=char        
[2021-10-06 15:54:39.927] [info] bad1.c:2:12    TYPE=STAR    TEXT=*           
[2021-10-06 15:54:39.928] [info] bad1.c:2:15    TYPE=ID      TEXT=s0          
[2021-10-06 15:54:39.928] [info] bad1.c:2:17    TYPE=EQUAL   TEXT==           
[2021-10-06 15:54:39.929] [error] Lexical error: Missing closing quote (bad1.c:2:34,87)
[2021-10-06 15:54:39.929] [error] 2 | const char * s0 = "string is here 
[2021-10-06 15:54:39.929] [error]                                      ^ here
[2021-10-06 15:54:39.929] [info] bad1.c:3:20    TYPE=STR     STR="string is here 
but not terminated"
[2021-10-06 15:54:39.929] [info] bad1.c:3:21    TYPE=SEMIC   TEXT=;           
[2021-10-06 15:54:39.929] [info] bad1.c:5:5     TYPE=CONST   TEXT=const       
[2021-10-06 15:54:39.930] [info] bad1.c:5:10    TYPE=CHAR    TEXT=char        
[2021-10-06 15:54:39.930] [info] bad1.c:5:12    TYPE=STAR    TEXT=*           
[2021-10-06 15:54:39.930] [info] bad1.c:5:15    TYPE=ID      TEXT=s1          
[2021-10-06 15:54:39.930] [info] bad1.c:5:17    TYPE=EQUAL   TEXT==           
[2021-10-06 15:54:39.930] [info] bad1.c:6:21    TYPE=STR     STR="another string is here \
but not terminated"
[2021-10-06 15:54:39.930] [info] bad1.c:6:22    TYPE=SEMIC   TEXT=;           
[2021-10-06 15:54:39.930] [info] bad1.c:8:3     TYPE=INT     TEXT=int         
[2021-10-06 15:54:39.931] [info] bad1.c:8:5     TYPE=ID      TEXT=c           
[2021-10-06 15:54:39.931] [info] bad1.c:8:7     TYPE=EQUAL   TEXT==           
[2021-10-06 15:54:39.931] [error] Lexical error: Expect char literal, nothing given (bad1.c:8:9,185)
[2021-10-06 15:54:39.931] [error] 8 | int c = ''
[2021-10-06 15:54:39.931] [error]             ^ here
[2021-10-06 15:54:39.931] [info] bad1.c:8:10    TYPE=CHR     TEXT=            
[2021-10-06 15:54:39.932] [info] bad1.c:10:7    TYPE=TYPEDEF TEXT=typedef     
[2021-10-06 15:54:39.932] [info] bad1.c:10:9    TYPE=PAR_L   TEXT=(           
[2021-10-06 15:54:39.932] [info] bad1.c:10:14   TYPE=CONST   TEXT=const       
[2021-10-06 15:54:39.932] [info] bad1.c:10:19   TYPE=CHAR    TEXT=char        
[2021-10-06 15:54:39.933] [info] bad1.c:10:21   TYPE=STAR    TEXT=*           
[2021-10-06 15:54:39.933] [info] bad1.c:10:22   TYPE=PAR_R   TEXT=)           
[2021-10-06 15:54:39.933] [info] bad1.c:10:32   TYPE=ID      TEXT=fixed_str   
[2021-10-06 15:54:39.933] [info] bad1.c:10:33   TYPE=SEMIC   TEXT=;           
[2021-10-06 15:54:39.933] [info] bad1.c:12:9    TYPE=ID      TEXT=fixed_str   
[2021-10-06 15:54:39.933] [info] bad1.c:12:30   TYPE=STR     STR="你好,世界!"
[2021-10-06 15:54:39.934] [error] Lexical error: Unterminated block comment (bad1.c:15:33,291)
[2021-10-06 15:54:39.934] [error] 15 |  Unterminated block comment test
[2021-10-06 15:54:39.934] [error]                                      ^ here
[2021-10-06 15:54:39.934] [info] bad1.c:15:36   Block comment: 
/*
 Unterminated block comment test
[2021-10-06 15:54:39.934] [info] Scan over

转义字符、无效转义字符测试

输入:

dist/bad2.c

#include <stdio.h>
int main(int argc, char const *argv[])
{
    // 有效转义
    printf ("测试 \u1234\r\n\0");
    // 无效转义
    printf ("\BU\P\T\ 北京 \ 邮电")
    return 0;
}

输出:

[2021-10-06 15:57:16.906] [info] Options loaded
[2021-10-06 15:57:16.906] [info] Started
[2021-10-06 15:57:16.907] [info] File is open
[2021-10-06 15:57:16.907] [info] Start scanning
[2021-10-06 15:57:16.908] [info] bad2.c:1:18    TYPE=PREPROC TEXT=#include <stdio.h>
[2021-10-06 15:57:16.909] [info] bad2.c:2:3     TYPE=INT     TEXT=int         
[2021-10-06 15:57:16.910] [info] bad2.c:2:8     TYPE=ID      TEXT=main        
[2021-10-06 15:57:16.911] [info] bad2.c:2:9     TYPE=PAR_L   TEXT=(           
[2021-10-06 15:57:16.912] [info] bad2.c:2:12    TYPE=INT     TEXT=int         
[2021-10-06 15:57:16.913] [info] bad2.c:2:17    TYPE=ID      TEXT=argc        
[2021-10-06 15:57:16.915] [info] bad2.c:2:18    TYPE=COMMA   TEXT=,           
[2021-10-06 15:57:16.916] [info] bad2.c:2:23    TYPE=CHAR    TEXT=char        
[2021-10-06 15:57:16.916] [info] bad2.c:2:29    TYPE=CONST   TEXT=const       
[2021-10-06 15:57:16.917] [info] bad2.c:2:31    TYPE=STAR    TEXT=*           
[2021-10-06 15:57:16.917] [info] bad2.c:2:35    TYPE=ID      TEXT=argv        
[2021-10-06 15:57:16.917] [info] bad2.c:2:36    TYPE=BRA_L   TEXT=[           
[2021-10-06 15:57:16.917] [info] bad2.c:2:37    TYPE=BRA_R   TEXT=]           
[2021-10-06 15:57:16.918] [info] bad2.c:2:38    TYPE=PAR_R   TEXT=)           
[2021-10-06 15:57:16.918] [info] bad2.c:3:1     TYPE=CUR_L   TEXT={           
[2021-10-06 15:57:16.918] [info] bad2.c:4:19    Inline comment: // 有效转义
[2021-10-06 15:57:16.918] [info] bad2.c:5:10    TYPE=ID      TEXT=printf      
[2021-10-06 15:57:16.918] [info] bad2.c:5:11    TYPE=PAR_L   TEXT=(           
[2021-10-06 15:57:16.918] [info] bad2.c:5:31    TYPE=STR     STR="测试 \u1234\r\n\0"
[2021-10-06 15:57:16.919] [info] bad2.c:5:32    TYPE=PAR_R   TEXT=)           
[2021-10-06 15:57:16.919] [info] bad2.c:5:33    TYPE=SEMIC   TEXT=;           
[2021-10-06 15:57:16.919] [info] bad2.c:6:19    Inline comment: // 无效转义
[2021-10-06 15:57:16.919] [info] bad2.c:7:10    TYPE=ID      TEXT=printf      
[2021-10-06 15:57:16.919] [info] bad2.c:7:11    TYPE=PAR_L   TEXT=(           
[2021-10-06 15:57:16.919] [error] Lexical error: Invalid escape char (bad2.c:7:12,146)
[2021-10-06 15:57:16.920] [error] 7 |     printf ("\BU\P\T\ 北京 \ 邮电");
[2021-10-06 15:57:16.923] [error]                ^ here
[2021-10-06 15:57:16.923] [error] Lexical error: Invalid escape char (bad2.c:7:15,149)
[2021-10-06 15:57:16.923] [error] 7 |     printf ("\BU\P\T\ 北京 \ 邮电");
[2021-10-06 15:57:16.923] [error]                   ^ here
[2021-10-06 15:57:16.923] [error] Lexical error: Invalid escape char (bad2.c:7:17,151)
[2021-10-06 15:57:16.923] [error] 7 |     printf ("\BU\P\T\ 北京 \ 邮电");
[2021-10-06 15:57:16.923] [error]                     ^ here
[2021-10-06 15:57:16.923] [error] Lexical error: Invalid escape char (bad2.c:7:19,153)
[2021-10-06 15:57:16.923] [error] 7 |     printf ("\BU\P\T\ 北京 \ 邮电");
[2021-10-06 15:57:16.923] [error]                       ^ here
[2021-10-06 15:57:16.923] [error] Lexical error: Invalid escape char (bad2.c:7:26,160)
[2021-10-06 15:57:16.923] [error] 7 |     printf ("\BU\P\T\ 北京 \ 邮电");
[2021-10-06 15:57:16.923] [error]                              ^ here
[2021-10-06 15:57:16.923] [info] bad2.c:7:34    TYPE=STR     STR="\BU\P\T\ 北京 \ 邮电"
[2021-10-06 15:57:16.923] [info] bad2.c:7:35    TYPE=PAR_R   TEXT=)           
[2021-10-06 15:57:16.923] [error] Lexical error: Unexpected char (bad2.c:7:36,170)
[2021-10-06 15:57:16.923] [error] 7 |     printf ("\BU\P\T\ 北京 \ 邮电");
[2021-10-06 15:57:16.923] [error]                                        ^ here
[2021-10-06 15:57:16.923] [error] Lexical error: Unexpected char (bad2.c:7:37,171)
[2021-10-06 15:57:16.923] [error] 7 |     printf ("\BU\P\T\ 北京 \ 邮电");
[2021-10-06 15:57:16.923] [error]                                         ^ here
[2021-10-06 15:57:16.923] [error] Lexical error: Unexpected char (bad2.c:7:38,172)
[2021-10-06 15:57:16.923] [error] 7 |     printf ("\BU\P\T\ 北京 \ 邮电");
[2021-10-06 15:57:16.923] [error]                                          ^ here
[2021-10-06 15:57:16.923] [info] bad2.c:8:10    TYPE=RETURN  TEXT=return      
[2021-10-06 15:57:16.923] [info] bad2.c:8:12    TYPE=NUM     TEXT=0           
[2021-10-06 15:57:16.923] [info] bad2.c:8:13    TYPE=SEMIC   TEXT=;           
[2021-10-06 15:57:16.923] [info] bad2.c:9:1     TYPE=CUR_R   TEXT=}           
[2021-10-06 15:57:16.923] [info] EOF
[2021-10-06 15:57:16.923] [info] Scan over

复杂情况测试(使用 Linux ls 源码)

输入:

dist/sample2.c


输出:

极复杂情况测试(测试 btree.c,约 10000 行)

输出约四百万行,这里写不下,见附件。

测试结论

程序运行良好。