C Lexer 设计与实现
题目
C 语言词法分析程序的设计与实现
实验内容及要求
-
可以识别出用 C 语言编写的源程序中的每个单词符号,并以记号的形式输出每个单词符号。
-
可以识别并跳过源程序中的注释。
-
可以统计源程序中的语句行数、各类单词的个数、以及字符总数,并输出统计结果。
-
检查源程序中存在的词法错误,并报告错误所在的位置。
-
对源程序中出现的错误进行适当的恢复,使词法分析可以继续进行,对源程序进行一次扫描,即可检查并报告源程序中存在的所有词法错误。
实现方法要求
分别用以下两种方法实现。
方法 1:采用 C/C++ 作为实现语言,手工编写词法分析程序。(必做)
方法 2:编写 LEX 源程序,利用 LEX 编译程序自动生成词法分析程序。
程序设计说明
词法分析介绍
词法分析(lexical analysis),又称扫描(scanning)会读取字符流输入,将其划分为有意义的序列,称为词素,对于每个词素,词法分析器产生如下形式的元组输出,此结构称为词符(Token),又译记号:
词符的 词符名 是一个用于下阶段 句法分析(syntax analysis)的抽象符号, 字段值 (又译属性值)则指向一个符号表中的条目(entry)。
本程序的特色
-
支持行数、列数显示,支持中日韩 CJK 字符。
-
支持任意大的源码文件读取。下图是 sqlite 中 btree 的实现,大约一万行:
因为是通过文件输出流进行读取,而非整个读入内存。
-
支持一边读取、一边分析、一边输出。通过回调函数实现;内部使用
std::istream
,不依赖文件流,理论上可以从网络流读取代码进行词法分析。 -
支持错误恢复,能够定位出多个词法错误。下面是本程序的报错示例:
-
支持预处理指令、注释等内容的定位。
-
支持浮点数、指数、十六进制等各种数字。
-
支持转义字符、续行符等各种特殊情况。
能够判断转义是否有效:
-
友好的错误提示。(如上图)
-
支持多种日志级别。通过环境变量可以设置。
设计思路
针对 C 代码的词法分析,基本思路是在文件流中,逐字符读取,根据字符的类型,进入不同的匹配分支进行匹配。
词符类型
首先定义词符类型,通过阅读参考 C keywords - cppreference.com 得到:
include/token_type.h
namespace lb_lexer
{
enum token_type
{
PAR_L, // (
PAR_R, // )
BRA_L, // [
BRA_R, // ]
CUR_L, // {
CUR_R, // }
COMMA, // ,
DOT, // .
PLUS, // +
SEMIC, // ;
SLASH, // /
STAR, // *
COLON, // :
PERCT, // %
QUEST, // %
INLCOM, // 行内注释 //
BLKCOM, // 块注释 /* ... */
PREPROC, // 预处理指令
MINUS, // -
PTMEM, //-> 指向成员
BANG, // !
BANGE, // !=
EQUAL, // =
EQUALE, // ==
GREAT, // >
GREATE, // >=
LESS, // <
LESSE, // <=
CARET, // ^
CARETE, // ^=
TILDE, // ~
TILDEE, // ~=
AND, // &
AAND, // &&
OR, // |
OOR, // ||
PLUSE, // +=
MINUSE, // -=
MULTPE, // *=
DIVE, // /=
// 字面量
ID, // identifier
CHR, // char 'x'
STR, // string
NUM, // number
// 关键字.
AUTO,
BREAK,
CASE,
CHAR,
CONST,
CONTINUE,
DEFAULT,
DO,
DOUBLE,
ELSE,
ENUM,
EXTERN,
FLOAT,
FOR,
GOTO,
IF,
INLINE,
INT,
LONG,
REGISTER,
RESTRICT,
RETURN,
SHORT,
SIGNED,
SIZEOF,
STATIC,
STRUCT,
SWITCH,
TYPEDEF,
UNION,
UNSIGNED,
VOID,
VOLATILE,
WHILE,
_ALIGNAS,
_ALIGNOF,
_ATOMIC,
_BOOL,
_COMPLEX,
_DECIMAL128,
_DECIMAL32,
_DECIMAL64,
_GENERIC,
_IMAGINARY,
_NORETURN,
_STATIC_ASSERT,
_THREAD_LOCAL,
END_OF_FILE,
LINE_BREAK,
UNKNOWN
};
}
错误类型
定义如下错误类型:
include/scanner.h
enum lexical_etype
{
UNEXPECTED_CHAR,
UNTERMINATED_STRING,
UNTERMINATED_CHAR,
EMPTY_CHAR_LITERAL,
UNTERMINATED_BLOCK_COMMENT,
INVALID_ESCAPE_CHAR
};
词法分析
主要通过一个分支语句实现:
src/scanner.cpp
void
scanner::scan_token ()
{
auto c = advance ();
switch (c)
{
case EOF:
spdlog::info ("EOF");
finished = true;
break;
case '#':
match_preproc ();
break;
case '(':
fast_yield (token_type::PAR_L);
break;
case ')':
fast_yield (token_type::PAR_R);
break;
case '{':
fast_yield (token_type::CUR_L);
break;
case '}':
fast_yield (token_type::CUR_R);
break;
case '[':
fast_yield (token_type::BRA_L);
break;
case ']':
fast_yield (token_type::BRA_R);
break;
case ',':
fast_yield (token_type::COMMA);
break;
case '.':
fast_yield (token_type::DOT);
break;
case '+':
fast_yield (token_type::PLUS);
break;
case ':':
fast_yield (token_type::COLON);
break;
case ';':
fast_yield (token_type::SEMIC);
break;
case '*':
fast_yield (token_type::STAR);
break;
case '%':
fast_yield (token_type::PERCT);
break;
case '?':
fast_yield (token_type::QUEST);
break;
case '-':
fast_yield (match ('>') ? token_type::PTMEM : token_type::MINUS);
break;
case '&':
fast_yield (match ('&') ? token_type::AAND : token_type::AND);
break;
case '|':
fast_yield (match ('|') ? token_type::OOR : token_type::OR);
break;
case '!':
fast_yield (match ('=') ? token_type::BANGE : token_type::BANG);
break;
case '=':
fast_yield (match ('=') ? token_type::EQUALE : token_type::EQUAL);
break;
case '<':
fast_yield (match ('=') ? token_type::LESSE : token_type::LESS);
break;
case '>':
fast_yield (match ('=') ? token_type::GREATE : token_type::GREAT);
break;
case '~':
fast_yield (match ('=') ? token_type::TILDE : token_type::TILDE);
break;
case '^':
fast_yield (match ('=') ? token_type::CARETE : token_type::CARET);
break;
case '/':
if (match ('/'))
{
match_inline_comment ();
}
else if (match ('*'))
{
match_block_comment ();
}
else
{
fast_yield (token_type::SLASH);
}
break;
case ' ':
case '\r':
case '\t':
// Ignore whitespace.
break;
case '\n':
col = 0;
line++;
break;
case '\'':
match_char ();
break;
case '"':
match_string ();
break;
default:
if (is_digit (c))
match_number ();
else if (is_word (c))
match_identifier ();
else
{
if (on_error (lexical_etype::UNEXPECTED_CHAR))
{
// try recover
}
else
{
throw std::exception ();
}
}
break;
}
}
对于更复杂的词符,通过子程序实现:
include/scanner.h
// 匹配标识符
void match_identifier ();
// 匹配数字
void match_number ();
// 匹配字符
void match_char ();
// 匹配字符串
void match_string ();
// 匹配行内注释
void match_inline_comment ();
// 匹配块注释
void match_block_comment ();
// 匹配预处理指令
void match_preproc ();
定位通过如下变量实现:
include/scanner.h
// 是否源码读取完毕
bool finished;
//buffer 起始位置指针
size_t start;
//buffer 当前位置指针
size_t current;
// 当前行
size_t line;
// 当前列(CJK 按 Unicode 拆为三个字符,能够保证字节数正常)
size_t col;
// 当前列(CJK 算一个字符)
size_t wcol;
// 已经读取的字节数,也即当前位置
size_t pos;
处理转义通过如下程序实现:
src/scanner.cpp
void
scanner::handle_escape ()
{
if (peek () == '\\')
{
switch (peek_next ())
{
case '\n': // 续行
new_line ();
advance ();
break;
/* 八进制 */
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
advance ();
break;
// see https://en.wikipedia.org/wiki/Escape_sequences_in_C
case 'a': // Alert (bell, alarm)
case 'b': // Backspace
case 'e': // Escape character
case 'f': // Form feed (new page)
case 'n': // New-line
case 'r': // Carriage return
case 't': // Horizontal tab
case 'v': // Vertical tab
case '\'': // Single quotation mark
case '\"': // Double quotation mark
case '?': // Question mark
case '\\': // Backslash
case 'u': // Unicode code point below 10000 hexadecimal (added in
// C99)[1]: 26
case 'U': // Unicode code point where h is a hexadecimal digit
advance ();
break;
default: /* Escaped character like \ ^ : = */
if (!on_error (lexical_etype::INVALID_ESCAPE_CHAR))
{
throw std::exception ();
}
advance ();
break;
}
}
}
源程序
见附件,另外本程序已开源,见 pluveto/lb_lexer (github.com)
源程序使用说明:
开发环境准备
操作系统:以 Debian 11 为例,其它发行版同理。
Windows 参考 此处。
安装系统依赖
$ sudo apt install cmake
安装包管理器 vcpkg
$ mkdir ~/app
$ cd ~/app
$ git clone https://github.com/microsoft/vcpkg
$ ./vcpkg/bootstrap-vcpkg.sh
$ sudo ln -s ./vcpkg/vcpkg /usr/bin/vcpkg
安装依赖
vcpkg install spdlog
配置 VSCode 环境
VSCode 需要安装 C/C++、CMake、CMake Tools 插件。(推荐直接安装全套插件 C/C++ Extension Pack)
.vscode/settings.json
追加如下配置:
{
"cmake.buildArgs": [ "-DCMAKE_TOOLCHAIN_FILE:FILEPATH=$VCPKG_PATH/scripts/buildsystems/vcpkg.cmake"],
"cmake.configureArgs": [ "-DCMAKE_TOOLCHAIN_FILE:FILEPATH=$VCPKG_PATH/scripts/buildsystems/vcpkg.cmake"],
"C_Cpp.default.includePath": [
"$VCPKG_PATH/installed/x64-linux/include",
],
"C_Cpp.default.configurationProvider": "ms-vscode.cmake-tools",
"cmake.configureOnOpen": true,
}
其中 $VCPKG_PATH
替换为你安装的 vcpkg 的绝对目录路径,如 /home/pluveto/app/
.
编译和运行
执行 CMake Configure
打开 VSCode 命令面板(F1),执行 CMake Configure
.
执行编译脚本
mr
是编写好的编译、运行脚本。使用 ./mr
执行。
如果权限不足,请
$ chmod +x mr
程序将会完成编译并运行。
可执行程序
可通过 CMake 编译。对于 Linux 系统的二进制文件见附件。
测试报告
字符串、中文字符串、十六进制、小数、指数测试
dist/sample.c
#include <stdio.h>
#include <stdlib.h>
int
main ()
{
// Master spark!
/**
* lalalalal
* ffffffffffffff
*/
const char * str = "Hello, /*This is a test*/\r\n\t\"\"";
const char * str_zh = "你好,这是一个测试程序";
// 这是一句中文注释
const char ch = '\\';
char s2[] = "//\\";
int x = 0x01;
float a = 0.302;
float b = -128.101;
double c = 123;
float d = 112.64E3;
double e = 0.7623e-2;
float f = 1.23002398;
printf ("a=%e \nb=%f \nc=%lf \nd=%lE \ne=%lf \nf=%f\n", a, b, c, d, e, f);
return 0;
}
输出:
[2021-10-06 15:59:00.069] [info] Options loaded
[2021-10-06 15:59:00.070] [info] Started
[2021-10-06 15:59:00.071] [info] File is open
[2021-10-06 15:59:00.072] [info] Start scanning
[2021-10-06 15:59:00.072] [info] sample.c:1:18 TYPE=PREPROC TEXT=#include <stdio.h>
[2021-10-06 15:59:00.077] [info] sample.c:2:19 TYPE=PREPROC TEXT=#include <stdlib.h>
[2021-10-06 15:59:00.078] [info] sample.c:3:3 TYPE=INT TEXT=int
[2021-10-06 15:59:00.079] [info] sample.c:4:4 TYPE=ID TEXT=main
[2021-10-06 15:59:00.080] [info] sample.c:4:6 TYPE=PAR_L TEXT=(
[2021-10-06 15:59:00.081] [info] sample.c:4:7 TYPE=PAR_R TEXT=)
[2021-10-06 15:59:00.082] [info] sample.c:5:1 TYPE=CUR_L TEXT={
[2021-10-06 15:59:00.083] [info] sample.c:6:18 Inline comment: // Master spark!
[2021-10-06 15:59:00.084] [info] sample.c:10:6 Block comment:
/**
* lalalalal
* ffffffffffffff
*/
[2021-10-06 15:59:00.084] [info] sample.c:11:7 TYPE=CONST TEXT=const
[2021-10-06 15:59:00.085] [info] sample.c:11:12 TYPE=CHAR TEXT=char
[2021-10-06 15:59:00.085] [info] sample.c:11:14 TYPE=STAR TEXT=*
[2021-10-06 15:59:00.085] [info] sample.c:11:18 TYPE=ID TEXT=str
[2021-10-06 15:59:00.086] [info] sample.c:11:20 TYPE=EQUAL TEXT==
[2021-10-06 15:59:00.087] [info] sample.c:11:58 TYPE=STR STR="Hello, /*This is a test*/\r\n\t\"\""
[2021-10-06 15:59:00.087] [info] sample.c:11:59 TYPE=SEMIC TEXT=;
[2021-10-06 15:59:00.087] [info] sample.c:12:7 TYPE=CONST TEXT=const
[2021-10-06 15:59:00.088] [info] sample.c:12:12 TYPE=CHAR TEXT=char
[2021-10-06 15:59:00.088] [info] sample.c:12:14 TYPE=STAR TEXT=*
[2021-10-06 15:59:00.088] [info] sample.c:12:21 TYPE=ID TEXT=str_zh
[2021-10-06 15:59:00.088] [info] sample.c:12:23 TYPE=EQUAL TEXT==
[2021-10-06 15:59:00.088] [info] sample.c:12:59 TYPE=STR STR="你好,这是一个测试程序"
[2021-10-06 15:59:00.088] [info] sample.c:12:60 TYPE=SEMIC TEXT=;
[2021-10-06 15:59:00.089] [info] sample.c:13:29 Inline comment: // 这是一句中文注释
[2021-10-06 15:59:00.089] [info] sample.c:14:7 TYPE=CONST TEXT=const
[2021-10-06 15:59:00.089] [info] sample.c:14:12 TYPE=CHAR TEXT=char
[2021-10-06 15:59:00.089] [info] sample.c:14:15 TYPE=ID TEXT=ch
[2021-10-06 15:59:00.089] [info] sample.c:14:17 TYPE=EQUAL TEXT==
[2021-10-06 15:59:00.089] [info] sample.c:14:22 TYPE=CHR TEXT=
[2021-10-06 15:59:00.089] [info] sample.c:14:23 TYPE=SEMIC TEXT=;
[2021-10-06 15:59:00.089] [info] sample.c:15:6 TYPE=CHAR TEXT=char
[2021-10-06 15:59:00.089] [info] sample.c:15:9 TYPE=ID TEXT=s2
[2021-10-06 15:59:00.090] [info] sample.c:15:10 TYPE=BRA_L TEXT=[
[2021-10-06 15:59:00.090] [info] sample.c:15:11 TYPE=BRA_R TEXT=]
[2021-10-06 15:59:00.090] [info] sample.c:15:13 TYPE=EQUAL TEXT==
[2021-10-06 15:59:00.090] [info] sample.c:15:20 TYPE=STR STR="//\\"
[2021-10-06 15:59:00.090] [info] sample.c:15:21 TYPE=SEMIC TEXT=;
[2021-10-06 15:59:00.090] [info] sample.c:16:5 TYPE=INT TEXT=int
[2021-10-06 15:59:00.090] [info] sample.c:16:7 TYPE=ID TEXT=x
[2021-10-06 15:59:00.090] [info] sample.c:16:9 TYPE=EQUAL TEXT==
[2021-10-06 15:59:00.090] [info] sample.c:16:14 TYPE=NUM TEXT=0x01
[2021-10-06 15:59:00.091] [info] sample.c:16:15 TYPE=SEMIC TEXT=;
[2021-10-06 15:59:00.091] [info] sample.c:17:7 TYPE=FLOAT TEXT=float
[2021-10-06 15:59:00.091] [info] sample.c:17:9 TYPE=ID TEXT=a
[2021-10-06 15:59:00.091] [info] sample.c:17:11 TYPE=EQUAL TEXT==
[2021-10-06 15:59:00.091] [info] sample.c:17:17 TYPE=NUM TEXT=0.302
[2021-10-06 15:59:00.092] [info] sample.c:17:18 TYPE=SEMIC TEXT=;
[2021-10-06 15:59:00.092] [info] sample.c:18:7 TYPE=FLOAT TEXT=float
[2021-10-06 15:59:00.092] [info] sample.c:18:9 TYPE=ID TEXT=b
[2021-10-06 15:59:00.092] [info] sample.c:18:11 TYPE=EQUAL TEXT==
[2021-10-06 15:59:00.093] [info] sample.c:18:13 TYPE=MINUS TEXT=-
[2021-10-06 15:59:00.093] [info] sample.c:18:20 TYPE=NUM TEXT=128.101
[2021-10-06 15:59:00.093] [info] sample.c:18:21 TYPE=SEMIC TEXT=;
[2021-10-06 15:59:00.093] [info] sample.c:19:8 TYPE=DOUBLE TEXT=double
[2021-10-06 15:59:00.093] [info] sample.c:19:10 TYPE=ID TEXT=c
[2021-10-06 15:59:00.093] [info] sample.c:19:12 TYPE=EQUAL TEXT==
[2021-10-06 15:59:00.093] [info] sample.c:19:16 TYPE=NUM TEXT=123
[2021-10-06 15:59:00.093] [info] sample.c:19:17 TYPE=SEMIC TEXT=;
[2021-10-06 15:59:00.093] [info] sample.c:20:7 TYPE=FLOAT TEXT=float
[2021-10-06 15:59:00.094] [info] sample.c:20:9 TYPE=ID TEXT=d
[2021-10-06 15:59:00.094] [info] sample.c:20:11 TYPE=EQUAL TEXT==
[2021-10-06 15:59:00.094] [info] sample.c:20:20 TYPE=NUM TEXT=112.64E3
[2021-10-06 15:59:00.094] [info] sample.c:20:21 TYPE=SEMIC TEXT=;
[2021-10-06 15:59:00.094] [info] sample.c:21:8 TYPE=DOUBLE TEXT=double
[2021-10-06 15:59:00.094] [info] sample.c:21:10 TYPE=ID TEXT=e
[2021-10-06 15:59:00.094] [info] sample.c:21:12 TYPE=EQUAL TEXT==
[2021-10-06 15:59:00.095] [info] sample.c:21:22 TYPE=NUM TEXT=0.7623e-2
[2021-10-06 15:59:00.095] [info] sample.c:21:23 TYPE=SEMIC TEXT=;
[2021-10-06 15:59:00.095] [info] sample.c:22:7 TYPE=FLOAT TEXT=float
[2021-10-06 15:59:00.095] [info] sample.c:22:9 TYPE=ID TEXT=f
[2021-10-06 15:59:00.095] [info] sample.c:22:11 TYPE=EQUAL TEXT==
[2021-10-06 15:59:00.095] [info] sample.c:22:22 TYPE=NUM TEXT=1.23002398
[2021-10-06 15:59:00.095] [info] sample.c:22:23 TYPE=SEMIC TEXT=;
[2021-10-06 15:59:00.095] [info] sample.c:23:8 TYPE=ID TEXT=printf
[2021-10-06 15:59:00.095] [info] sample.c:23:10 TYPE=PAR_L TEXT=(
[2021-10-06 15:59:00.096] [info] sample.c:23:56 TYPE=STR STR="a=%e \nb=%f \nc=%lf \nd=%lE \ne=%lf \nf=%f\n"
[2021-10-06 15:59:00.096] [info] sample.c:23:57 TYPE=COMMA TEXT=,
[2021-10-06 15:59:00.096] [info] sample.c:23:59 TYPE=ID TEXT=a
[2021-10-06 15:59:00.096] [info] sample.c:23:60 TYPE=COMMA TEXT=,
[2021-10-06 15:59:00.096] [info] sample.c:23:62 TYPE=ID TEXT=b
[2021-10-06 15:59:00.098] [info] sample.c:23:63 TYPE=COMMA TEXT=,
[2021-10-06 15:59:00.098] [info] sample.c:23:65 TYPE=ID TEXT=c
[2021-10-06 15:59:00.098] [info] sample.c:23:66 TYPE=COMMA TEXT=,
[2021-10-06 15:59:00.098] [info] sample.c:23:68 TYPE=ID TEXT=d
[2021-10-06 15:59:00.099] [info] sample.c:23:69 TYPE=COMMA TEXT=,
[2021-10-06 15:59:00.099] [info] sample.c:23:71 TYPE=ID TEXT=e
[2021-10-06 15:59:00.099] [info] sample.c:23:72 TYPE=COMMA TEXT=,
[2021-10-06 15:59:00.099] [info] sample.c:23:74 TYPE=ID TEXT=f
[2021-10-06 15:59:00.099] [info] sample.c:23:75 TYPE=PAR_R TEXT=)
[2021-10-06 15:59:00.099] [info] sample.c:23:76 TYPE=SEMIC TEXT=;
[2021-10-06 15:59:00.099] [info] sample.c:24:8 TYPE=RETURN TEXT=return
[2021-10-06 15:59:00.099] [info] sample.c:24:10 TYPE=NUM TEXT=0
[2021-10-06 15:59:00.100] [info] sample.c:24:11 TYPE=SEMIC TEXT=;
[2021-10-06 15:59:00.100] [info] sample.c:25:1 TYPE=CUR_R TEXT=}
[2021-10-06 15:59:00.100] [info] EOF
[2021-10-06 15:59:00.100] [info] Scan over
未结尾串、无效字符、无效注释、有效注释、续行符测试
bad1.c
测试未结尾串、空字符、续行符
dist/bad1.c
//bad1.c 测试未结尾串、空字符、续行符
const char * s0 = "string is here
but not terminated";
const char * s1 = "another string is here \
but not terminated";
int c = ''
typedef (const char *) fixed_str;
fixed_str "你好,世界!"
/*
Unterminated block comment test
运行结果:
[2021-10-06 15:44:54.775] [info] Options loaded
[2021-10-06 15:44:54.775] [info] Started
[2021-10-06 15:44:54.776] [info] File is open
[2021-10-06 15:44:54.777] [info] Start scanning
[2021-10-06 15:44:54.777] [info] sample.c:1:18 TYPE=PREPROC TEXT=#include <stdio.h>
[2021-10-06 15:44:54.778] [info] sample.c:2:19 TYPE=PREPROC TEXT=#include <stdlib.h>
[2021-10-06 15:44:54.779] [info] sample.c:3:3 TYPE=INT TEXT=int
[2021-10-06 15:44:54.780] [info] sample.c:4:4 TYPE=ID TEXT=main
[2021-10-06 15:44:54.780] [info] sample.c:4:6 TYPE=PAR_L TEXT=(
[2021-10-06 15:44:54.781] [info] sample.c:4:7 TYPE=PAR_R TEXT=)
[2021-10-06 15:44:54.782] [info] sample.c:5:1 TYPE=CUR_L TEXT={
[2021-10-06 15:44:54.783] [info] sample.c:6:18 Inline comment: // Master spark!
[2021-10-06 15:44:54.784] [info] sample.c:10:6 Block comment:
/**
* lalalalal
* ffffffffffffff
*/
[2021-10-06 15:44:54.784] [info] sample.c:11:7 TYPE=CONST TEXT=const
[2021-10-06 15:44:54.785] [info] sample.c:11:12 TYPE=CHAR TEXT=char
[2021-10-06 15:44:54.785] [info] sample.c:11:14 TYPE=STAR TEXT=*
[2021-10-06 15:44:54.785] [info] sample.c:11:18 TYPE=ID TEXT=str
[2021-10-06 15:44:54.785] [info] sample.c:11:20 TYPE=EQUAL TEXT==
[2021-10-06 15:44:54.786] [info] sample.c:11:58 TYPE=STR STR="Hello, /*This is a test*/\r\n\t\"\""
[2021-10-06 15:44:54.786] [info] sample.c:11:59 TYPE=SEMIC TEXT=;
[2021-10-06 15:44:54.786] [info] sample.c:12:7 TYPE=CONST TEXT=const
[2021-10-06 15:44:54.789] [info] sample.c:12:12 TYPE=CHAR TEXT=char
[2021-10-06 15:44:54.789] [info] sample.c:12:14 TYPE=STAR TEXT=*
[2021-10-06 15:44:54.789] [info] sample.c:12:21 TYPE=ID TEXT=str_zh
[2021-10-06 15:44:54.789] [info] sample.c:12:23 TYPE=EQUAL TEXT==
[2021-10-06 15:44:54.790] [info] sample.c:12:59 TYPE=STR STR="你好,这是一个测试程序"
[2021-10-06 15:44:54.790] [info] sample.c:12:60 TYPE=SEMIC TEXT=;
[2021-10-06 15:44:54.790] [info] sample.c:13:29 Inline comment: // 这是一句中文注释
[2021-10-06 15:44:54.790] [info] sample.c:14:7 TYPE=CONST TEXT=const
[2021-10-06 15:44:54.791] [info] sample.c:14:12 TYPE=CHAR TEXT=char
[2021-10-06 15:44:54.791] [info] sample.c:14:15 TYPE=ID TEXT=ch
[2021-10-06 15:44:54.791] [info] sample.c:14:17 TYPE=EQUAL TEXT==
[2021-10-06 15:44:54.792] [info] sample.c:14:22 TYPE=CHR TEXT=
[2021-10-06 15:44:54.792] [info] sample.c:14:23 TYPE=SEMIC TEXT=;
[2021-10-06 15:44:54.792] [info] sample.c:15:6 TYPE=CHAR TEXT=char
[2021-10-06 15:44:54.792] [info] sample.c:15:9 TYPE=ID TEXT=s2
[2021-10-06 15:44:54.792] [info] sample.c:15:10 TYPE=BRA_L TEXT=[
[2021-10-06 15:44:54.792] [info] sample.c:15:11 TYPE=BRA_R TEXT=]
[2021-10-06 15:44:54.792] [info] sample.c:15:13 TYPE=EQUAL TEXT==
[2021-10-06 15:44:54.792] [info] sample.c:15:20 TYPE=STR STR="//\\"
[2021-10-06 15:44:54.792] [info] sample.c:15:21 TYPE=SEMIC TEXT=;
[2021-10-06 15:44:54.792] [info] sample.c:16:5 TYPE=INT TEXT=int
[2021-10-06 15:44:54.792] [info] sample.c:16:7 TYPE=ID TEXT=x
[2021-10-06 15:44:54.792] [info] sample.c:16:9 TYPE=EQUAL TEXT==
[2021-10-06 15:44:54.792] [info] sample.c:16:14 TYPE=NUM TEXT=0x01
[2021-10-06 15:44:54.793] [info] sample.c:16:15 TYPE=SEMIC TEXT=;
[2021-10-06 15:44:54.797] [info] sample.c:17:7 TYPE=FLOAT TEXT=float
[2021-10-06 15:44:54.797] [info] sample.c:17:9 TYPE=ID TEXT=a
[2021-10-06 15:44:54.797] [info] sample.c:17:11 TYPE=EQUAL TEXT==
[2021-10-06 15:44:54.797] [info] sample.c:17:17 TYPE=NUM TEXT=0.302
[2021-10-06 15:44:54.797] [info] sample.c:17:18 TYPE=SEMIC TEXT=;
[2021-10-06 15:44:54.797] [info] sample.c:18:7 TYPE=FLOAT TEXT=float
[2021-10-06 15:44:54.798] [info] sample.c:18:9 TYPE=ID TEXT=b
[2021-10-06 15:44:54.798] [info] sample.c:18:11 TYPE=EQUAL TEXT==
[2021-10-06 15:44:54.798] [info] sample.c:18:13 TYPE=MINUS TEXT=-
[2021-10-06 15:44:54.798] [info] sample.c:18:20 TYPE=NUM TEXT=128.101
[2021-10-06 15:44:54.798] [info] sample.c:18:21 TYPE=SEMIC TEXT=;
[2021-10-06 15:44:54.798] [info] sample.c:19:8 TYPE=DOUBLE TEXT=double
[2021-10-06 15:44:54.799] [info] sample.c:19:10 TYPE=ID TEXT=c
[2021-10-06 15:44:54.799] [info] sample.c:19:12 TYPE=EQUAL TEXT==
[2021-10-06 15:44:54.799] [info] sample.c:19:16 TYPE=NUM TEXT=123
[2021-10-06 15:44:54.799] [info] sample.c:19:17 TYPE=SEMIC TEXT=;
[2021-10-06 15:44:54.799] [info] sample.c:20:7 TYPE=FLOAT TEXT=float
[2021-10-06 15:44:54.799] [info] sample.c:20:9 TYPE=ID TEXT=d
[2021-10-06 15:44:54.799] [info] sample.c:20:11 TYPE=EQUAL TEXT==
[2021-10-06 15:44:54.800] [info] sample.c:20:20 TYPE=NUM TEXT=112.64E3
[2021-10-06 15:44:54.800] [info] sample.c:20:21 TYPE=SEMIC TEXT=;
[2021-10-06 15:44:54.800] [info] sample.c:21:8 TYPE=DOUBLE TEXT=double
[2021-10-06 15:44:54.800] [info] sample.c:21:10 TYPE=ID TEXT=e
[2021-10-06 15:44:54.800] [info] sample.c:21:12 TYPE=EQUAL TEXT==
[2021-10-06 15:44:54.800] [info] sample.c:21:22 TYPE=NUM TEXT=0.7623e-2
[2021-10-06 15:44:54.800] [info] sample.c:21:23 TYPE=SEMIC TEXT=;
[2021-10-06 15:44:54.800] [info] sample.c:22:7 TYPE=FLOAT TEXT=float
[2021-10-06 15:44:54.800] [info] sample.c:22:9 TYPE=ID TEXT=f
[2021-10-06 15:44:54.800] [info] sample.c:22:11 TYPE=EQUAL TEXT==
[2021-10-06 15:44:54.800] [info] sample.c:22:22 TYPE=NUM TEXT=1.23002398
[2021-10-06 15:44:54.800] [info] sample.c:22:23 TYPE=SEMIC TEXT=;
[2021-10-06 15:44:54.800] [info] sample.c:23:8 TYPE=ID TEXT=printf
[2021-10-06 15:44:54.800] [info] sample.c:23:10 TYPE=PAR_L TEXT=(
[2021-10-06 15:44:54.800] [info] sample.c:23:56 TYPE=STR STR="a=%e \nb=%f \nc=%lf \nd=%lE \ne=%lf \nf=%f\n"
[2021-10-06 15:44:54.800] [info] sample.c:23:57 TYPE=COMMA TEXT=,
[2021-10-06 15:44:54.800] [info] sample.c:23:59 TYPE=ID TEXT=a
[2021-10-06 15:44:54.800] [info] sample.c:23:60 TYPE=COMMA TEXT=,
[2021-10-06 15:44:54.800] [info] sample.c:23:62 TYPE=ID TEXT=b
[2021-10-06 15:44:54.800] [info] sample.c:23:63 TYPE=COMMA TEXT=,
[2021-10-06 15:44:54.800] [info] sample.c:23:65 TYPE=ID TEXT=c
[2021-10-06 15:44:54.800] [info] sample.c:23:66 TYPE=COMMA TEXT=,
[2021-10-06 15:44:54.800] [info] sample.c:23:68 TYPE=ID TEXT=d
[2021-10-06 15:44:54.800] [info] sample.c:23:69 TYPE=COMMA TEXT=,
[2021-10-06 15:44:54.801] [info] sample.c:23:71 TYPE=ID TEXT=e
[2021-10-06 15:44:54.801] [info] sample.c:23:72 TYPE=COMMA TEXT=,
[2021-10-06 15:44:54.801] [info] sample.c:23:74 TYPE=ID TEXT=f
[2021-10-06 15:44:54.801] [info] sample.c:23:75 TYPE=PAR_R TEXT=)
[2021-10-06 15:44:54.801] [info] sample.c:23:76 TYPE=SEMIC TEXT=;
[2021-10-06 15:44:54.802] [info] sample.c:24:8 TYPE=RETURN TEXT=return
[2021-10-06 15:44:54.803] [info] sample.c:24:10 TYPE=NUM TEXT=0
[2021-10-06 15:44:54.803] [info] sample.c:24:11 TYPE=SEMIC TEXT=;
[2021-10-06 15:44:54.804] [info] sample.c:25:1 TYPE=CUR_R TEXT=}
[2021-10-06 15:44:54.804] [info] EOF
[2021-10-06 15:44:54.804] [info] Scan over
pluveto@devhost1:~/bupt-c-lexer/dist$ ./lb_lexer bad1.c
[2021-10-06 15:54:39.922] [info] Options loaded
[2021-10-06 15:54:39.923] [info] Started
[2021-10-06 15:54:39.923] [info] File is open
[2021-10-06 15:54:39.924] [info] Start scanning
[2021-10-06 15:54:39.924] [info] bad1.c:1:52 Inline comment: //bad1.c 测试未结尾串、空字符、续行符
[2021-10-06 15:54:39.925] [info] bad1.c:2:5 TYPE=CONST TEXT=const
[2021-10-06 15:54:39.926] [info] bad1.c:2:10 TYPE=CHAR TEXT=char
[2021-10-06 15:54:39.927] [info] bad1.c:2:12 TYPE=STAR TEXT=*
[2021-10-06 15:54:39.928] [info] bad1.c:2:15 TYPE=ID TEXT=s0
[2021-10-06 15:54:39.928] [info] bad1.c:2:17 TYPE=EQUAL TEXT==
[2021-10-06 15:54:39.929] [error] Lexical error: Missing closing quote (bad1.c:2:34,87)
[2021-10-06 15:54:39.929] [error] 2 | const char * s0 = "string is here
[2021-10-06 15:54:39.929] [error] ^ here
[2021-10-06 15:54:39.929] [info] bad1.c:3:20 TYPE=STR STR="string is here
but not terminated"
[2021-10-06 15:54:39.929] [info] bad1.c:3:21 TYPE=SEMIC TEXT=;
[2021-10-06 15:54:39.929] [info] bad1.c:5:5 TYPE=CONST TEXT=const
[2021-10-06 15:54:39.930] [info] bad1.c:5:10 TYPE=CHAR TEXT=char
[2021-10-06 15:54:39.930] [info] bad1.c:5:12 TYPE=STAR TEXT=*
[2021-10-06 15:54:39.930] [info] bad1.c:5:15 TYPE=ID TEXT=s1
[2021-10-06 15:54:39.930] [info] bad1.c:5:17 TYPE=EQUAL TEXT==
[2021-10-06 15:54:39.930] [info] bad1.c:6:21 TYPE=STR STR="another string is here \
but not terminated"
[2021-10-06 15:54:39.930] [info] bad1.c:6:22 TYPE=SEMIC TEXT=;
[2021-10-06 15:54:39.930] [info] bad1.c:8:3 TYPE=INT TEXT=int
[2021-10-06 15:54:39.931] [info] bad1.c:8:5 TYPE=ID TEXT=c
[2021-10-06 15:54:39.931] [info] bad1.c:8:7 TYPE=EQUAL TEXT==
[2021-10-06 15:54:39.931] [error] Lexical error: Expect char literal, nothing given (bad1.c:8:9,185)
[2021-10-06 15:54:39.931] [error] 8 | int c = ''
[2021-10-06 15:54:39.931] [error] ^ here
[2021-10-06 15:54:39.931] [info] bad1.c:8:10 TYPE=CHR TEXT=
[2021-10-06 15:54:39.932] [info] bad1.c:10:7 TYPE=TYPEDEF TEXT=typedef
[2021-10-06 15:54:39.932] [info] bad1.c:10:9 TYPE=PAR_L TEXT=(
[2021-10-06 15:54:39.932] [info] bad1.c:10:14 TYPE=CONST TEXT=const
[2021-10-06 15:54:39.932] [info] bad1.c:10:19 TYPE=CHAR TEXT=char
[2021-10-06 15:54:39.933] [info] bad1.c:10:21 TYPE=STAR TEXT=*
[2021-10-06 15:54:39.933] [info] bad1.c:10:22 TYPE=PAR_R TEXT=)
[2021-10-06 15:54:39.933] [info] bad1.c:10:32 TYPE=ID TEXT=fixed_str
[2021-10-06 15:54:39.933] [info] bad1.c:10:33 TYPE=SEMIC TEXT=;
[2021-10-06 15:54:39.933] [info] bad1.c:12:9 TYPE=ID TEXT=fixed_str
[2021-10-06 15:54:39.933] [info] bad1.c:12:30 TYPE=STR STR="你好,世界!"
[2021-10-06 15:54:39.934] [error] Lexical error: Unterminated block comment (bad1.c:15:33,291)
[2021-10-06 15:54:39.934] [error] 15 | Unterminated block comment test
[2021-10-06 15:54:39.934] [error] ^ here
[2021-10-06 15:54:39.934] [info] bad1.c:15:36 Block comment:
/*
Unterminated block comment test
[2021-10-06 15:54:39.934] [info] Scan over
转义字符、无效转义字符测试
输入:
dist/bad2.c
#include <stdio.h>
int main(int argc, char const *argv[])
{
// 有效转义
printf ("测试 \u1234\r\n\0");
// 无效转义
printf ("\BU\P\T\ 北京 \ 邮电");
return 0;
}
输出:
[2021-10-06 15:57:16.906] [info] Options loaded
[2021-10-06 15:57:16.906] [info] Started
[2021-10-06 15:57:16.907] [info] File is open
[2021-10-06 15:57:16.907] [info] Start scanning
[2021-10-06 15:57:16.908] [info] bad2.c:1:18 TYPE=PREPROC TEXT=#include <stdio.h>
[2021-10-06 15:57:16.909] [info] bad2.c:2:3 TYPE=INT TEXT=int
[2021-10-06 15:57:16.910] [info] bad2.c:2:8 TYPE=ID TEXT=main
[2021-10-06 15:57:16.911] [info] bad2.c:2:9 TYPE=PAR_L TEXT=(
[2021-10-06 15:57:16.912] [info] bad2.c:2:12 TYPE=INT TEXT=int
[2021-10-06 15:57:16.913] [info] bad2.c:2:17 TYPE=ID TEXT=argc
[2021-10-06 15:57:16.915] [info] bad2.c:2:18 TYPE=COMMA TEXT=,
[2021-10-06 15:57:16.916] [info] bad2.c:2:23 TYPE=CHAR TEXT=char
[2021-10-06 15:57:16.916] [info] bad2.c:2:29 TYPE=CONST TEXT=const
[2021-10-06 15:57:16.917] [info] bad2.c:2:31 TYPE=STAR TEXT=*
[2021-10-06 15:57:16.917] [info] bad2.c:2:35 TYPE=ID TEXT=argv
[2021-10-06 15:57:16.917] [info] bad2.c:2:36 TYPE=BRA_L TEXT=[
[2021-10-06 15:57:16.917] [info] bad2.c:2:37 TYPE=BRA_R TEXT=]
[2021-10-06 15:57:16.918] [info] bad2.c:2:38 TYPE=PAR_R TEXT=)
[2021-10-06 15:57:16.918] [info] bad2.c:3:1 TYPE=CUR_L TEXT={
[2021-10-06 15:57:16.918] [info] bad2.c:4:19 Inline comment: // 有效转义
[2021-10-06 15:57:16.918] [info] bad2.c:5:10 TYPE=ID TEXT=printf
[2021-10-06 15:57:16.918] [info] bad2.c:5:11 TYPE=PAR_L TEXT=(
[2021-10-06 15:57:16.918] [info] bad2.c:5:31 TYPE=STR STR="测试 \u1234\r\n\0"
[2021-10-06 15:57:16.919] [info] bad2.c:5:32 TYPE=PAR_R TEXT=)
[2021-10-06 15:57:16.919] [info] bad2.c:5:33 TYPE=SEMIC TEXT=;
[2021-10-06 15:57:16.919] [info] bad2.c:6:19 Inline comment: // 无效转义
[2021-10-06 15:57:16.919] [info] bad2.c:7:10 TYPE=ID TEXT=printf
[2021-10-06 15:57:16.919] [info] bad2.c:7:11 TYPE=PAR_L TEXT=(
[2021-10-06 15:57:16.919] [error] Lexical error: Invalid escape char (bad2.c:7:12,146)
[2021-10-06 15:57:16.920] [error] 7 | printf ("\BU\P\T\ 北京 \ 邮电");
[2021-10-06 15:57:16.923] [error] ^ here
[2021-10-06 15:57:16.923] [error] Lexical error: Invalid escape char (bad2.c:7:15,149)
[2021-10-06 15:57:16.923] [error] 7 | printf ("\BU\P\T\ 北京 \ 邮电");
[2021-10-06 15:57:16.923] [error] ^ here
[2021-10-06 15:57:16.923] [error] Lexical error: Invalid escape char (bad2.c:7:17,151)
[2021-10-06 15:57:16.923] [error] 7 | printf ("\BU\P\T\ 北京 \ 邮电");
[2021-10-06 15:57:16.923] [error] ^ here
[2021-10-06 15:57:16.923] [error] Lexical error: Invalid escape char (bad2.c:7:19,153)
[2021-10-06 15:57:16.923] [error] 7 | printf ("\BU\P\T\ 北京 \ 邮电");
[2021-10-06 15:57:16.923] [error] ^ here
[2021-10-06 15:57:16.923] [error] Lexical error: Invalid escape char (bad2.c:7:26,160)
[2021-10-06 15:57:16.923] [error] 7 | printf ("\BU\P\T\ 北京 \ 邮电");
[2021-10-06 15:57:16.923] [error] ^ here
[2021-10-06 15:57:16.923] [info] bad2.c:7:34 TYPE=STR STR="\BU\P\T\ 北京 \ 邮电"
[2021-10-06 15:57:16.923] [info] bad2.c:7:35 TYPE=PAR_R TEXT=)
[2021-10-06 15:57:16.923] [error] Lexical error: Unexpected char (bad2.c:7:36,170)
[2021-10-06 15:57:16.923] [error] 7 | printf ("\BU\P\T\ 北京 \ 邮电");
[2021-10-06 15:57:16.923] [error] ^ here
[2021-10-06 15:57:16.923] [error] Lexical error: Unexpected char (bad2.c:7:37,171)
[2021-10-06 15:57:16.923] [error] 7 | printf ("\BU\P\T\ 北京 \ 邮电");
[2021-10-06 15:57:16.923] [error] ^ here
[2021-10-06 15:57:16.923] [error] Lexical error: Unexpected char (bad2.c:7:38,172)
[2021-10-06 15:57:16.923] [error] 7 | printf ("\BU\P\T\ 北京 \ 邮电");
[2021-10-06 15:57:16.923] [error] ^ here
[2021-10-06 15:57:16.923] [info] bad2.c:8:10 TYPE=RETURN TEXT=return
[2021-10-06 15:57:16.923] [info] bad2.c:8:12 TYPE=NUM TEXT=0
[2021-10-06 15:57:16.923] [info] bad2.c:8:13 TYPE=SEMIC TEXT=;
[2021-10-06 15:57:16.923] [info] bad2.c:9:1 TYPE=CUR_R TEXT=}
[2021-10-06 15:57:16.923] [info] EOF
[2021-10-06 15:57:16.923] [info] Scan over
复杂情况测试(使用 Linux ls
源码)
输入:
dist/sample2.c
略
输出:
略
极复杂情况测试(测试 btree.c,约 10000 行)
输出约四百万行,这里写不下,见附件。
测试结论
程序运行良好。