用 Rust 开发 .git 目录泄漏恢复工具

上篇文章我们从原理上分析了,从网站暴露的 .git/ 目录中恢复整个 Git 仓库中的文件是可行的。当然,在实践中,我只使用了普通的仓库,并且这个泄漏的目录在网页是被 403 禁止访问的。

虽然说能够手动对这些文件进行提取,但是大一点的 Git 仓库文件非常多,手动操作非常耗费时间并且容易出错,既然有方法,而且是重复性有规律的操作,那么就可以用代码自动进行下载,由于最近痴迷 rust,所以这次的工具还是用 rust 来实现,并且之前在做其他工具的时候用到了 git2 库来直接操作 Git 仓库的,就不用自己去实现,开发起来也比较方便。但在开发这个工具的时候,却犯了难,git2 根本不适用于这个工具的实现,于是这次的实现,我选用了 gitoxide 库来实现,这个库把 git 一些底层操作都拆解成了一个小的库,非常适合这次的开发。

通过上一篇文章的分析,我们首先要通过 Git 仓库的配置文件获得分支名称才能获取到分支的对象 ID,而gix-config库就能够对配置文件进行解析,其中解析配置文件的代码会提供一系列事件结构,事件的定义如下:

pub enum Event<'a> {
    /// A comment with a comment tag and the comment itself. Note that the
    /// comment itself may contain additional whitespace and comment markers
    /// at the beginning, like `# comment` or `; comment`.
    Comment(Comment<'a>),
    /// A section header containing the section name and a subsection, if it
    /// exists. For instance, `remote "origin"` is parsed to `remote` as section
    /// name and `origin` as subsection name.
    SectionHeader(section::Header<'a>),
    /// A name to a value in a section, like `url` in `remote.origin.url`.
    SectionValueName(section::ValueName<'a>),
    /// A completed value. This may be any single-line string, including the empty string
    /// if an implicit boolean value is used.
    /// Note that these values may contain spaces and any special character. This value is
    /// also unprocessed, so it may contain double quotes that should be
    /// [normalized][crate::value::normalize()] before interpretation.
    Value(Cow<'a, BStr>),
    /// Represents any token used to signify a newline character. On Unix
    /// platforms, this is typically just `\n`, but can be any valid newline
    /// *sequence*. Multiple newlines (such as `\n\n`) will be merged as a single
    /// newline event containing a string of multiple newline characters.
    Newline(Cow<'a, BStr>),
    /// Any value that isn't completed. This occurs when the value is continued
    /// onto the next line by ending it with a backslash.
    /// A [`Newline`][Self::Newline] event is guaranteed after, followed by
    /// either a ValueDone, a Whitespace, or another ValueNotDone.
    ValueNotDone(Cow<'a, BStr>),
    /// The last line of a value which was continued onto another line.
    /// With this it's possible to obtain the complete value by concatenating
    /// the prior [`ValueNotDone`][Self::ValueNotDone] events.
    ValueDone(Cow<'a, BStr>),
    /// A continuous section of insignificant whitespace.
    ///
    /// Note that values with internal whitespace will not be separated by this event,
    /// hence interior whitespace there is always part of the value.
    Whitespace(Cow<'a, BStr>),
    /// This event is emitted when the parser counters a valid `=` character
    /// separating the key and value.
    /// This event is necessary as it eliminates the ambiguity for whitespace
    /// events between a key and value event.
    KeyValueSeparator,
}

由于分支名是配置节后的小节名称,所以这些事件中,只需要关注SectionHeader事件。

接着,这个工具需要实现对 commmit 对象解析的功能,gix-object库能对仓库中的对象文件进行解析,比如这里的 commit 对象,在gix-object中就能够被解析为CommitRef结构,其结构定义如下:

pub struct CommitRef<'a> {
    /// HEX hash of tree object we point to. Usually 40 bytes long.
    ///
    /// Use [`tree()`](CommitRef::tree()) to obtain a decoded version of it.
    #[cfg_attr(feature = "serde", serde(borrow))]
    pub tree: &'a BStr,
    /// HEX hash of each parent commit. Empty for first commit in repository.
    pub parents: SmallVec<[&'a BStr; 1]>,
    /// Who wrote this commit. Name and email might contain whitespace and are not trimmed to ensure round-tripping.
    ///
    /// Use the [`author()`](CommitRef::author()) method to received a trimmed version of it.
    pub author: gix_actor::SignatureRef<'a>,
    /// Who committed this commit. Name and email might contain whitespace and are not trimmed to ensure round-tripping.
    ///
    /// Use the [`committer()`](CommitRef::committer()) method to received a trimmed version of it.
    ///
    /// This may be different from the `author` in case the author couldn't write to the repository themselves and
    /// is commonly encountered with contributed commits.
    pub committer: gix_actor::SignatureRef<'a>,
    /// The name of the message encoding, otherwise [UTF-8 should be assumed](https://github.com/git/git/blob/e67fbf927dfdf13d0b21dc6ea15dc3c7ef448ea0/commit.c#L1493:L1493).
    pub encoding: Option<&'a BStr>,
    /// The commit message documenting the change.
    pub message: &'a BStr,
    /// Extra header fields, in order of them being encountered, made accessible with the iterator returned by [`extra_headers()`](CommitRef::extra_headers()).
    pub extra_headers: Vec<(&'a BStr, Cow<'a, BStr>)>,
}

在这里面,对我们有用的字段就是treeparents,有了这个提交的树对象ID,我们就能够解析出提交的目录结构以及其中包含的文件,有了提交的父提交,我们就能够通过父提交对象来找到仓库中所有的提交对象 ID 。

由于 git 中文件对象也就是 blob 对象,是直接使用 zlib 压缩的文件,所以我们在处理 blob 对象的时候,只需要将对象解压到正确的目录即可,而树对象,同样需要gix-object进行处理,其能够将树对象解析为TreeRef结构,该结构定义如下:

pub struct TreeRef<'a> {
    /// The directories and files contained in this tree.
    ///
    /// Beware that the sort order isn't *quite* by name, so one may bisect only with a [`tree::EntryRef`] to handle ordering correctly.
    #[cfg_attr(feature = "serde", serde(borrow))]
    pub entries: Vec<tree::EntryRef<'a>>,
}

pub struct EntryRef<'a> {
    /// The kind of object to which `oid` is pointing.
    pub mode: tree::EntryMode,
    /// The name of the file in the parent tree.
    pub filename: &'a BStr,
    /// The id of the object representing the entry.
    // TODO: figure out how these should be called. id or oid? It's inconsistent around the codebase.
    //       Answer: make it 'id', as in `git2`
    #[cfg_attr(feature = "serde", serde(borrow))]
    pub oid: &'a gix_hash::oid,
}

其中,mode字段能够体现当前的实体是目录还是文件,oid 就是对象的 ID。

最后别忘了index文件,index文件的解析需要使用gix-index库来实现。

基本功能分析完毕,下面就是主要的代码逻辑部分:

// 获得仓库的所有分支名称
fn get_branches(){
    获得 config 文件的二进制流
    使用 gix_config::parse::from_bytes 解析二进制流
    将得到的分支名称格式化为 'refs/heads/<branches>' 路径
    返回格式化后的字符串数组
}
// 解析 commit 对象
fn dump_commit(commit_sha1){
    获得 commit 对象二进制流
    使用 CommitRef::from_bytes 解析二进制流
    处理解析后的 CommitRef 结构
    获取父提交对象 ID 数组
    获取树对象 ID
    返回父亲提交对象 ID 数组以及树对象 ID
}
// 解析树对象
fn dump_tree(tree_sha1){
    获得树对象二进制流
    使用 TreeRef::from_bytes 解析二进制流
    处理解析后的 TreeRef 结构,获得树对象 ID 以及二进制对象 ID
    返回树对象 ID 以及二进制对象 ID
}

具体实现可以查看这个代码文件,该代码仅实现了一层的对象处理功能,并返回了详细的格式化信息,其目的是使得调用方能自定义自己的解析流程,我将自己的解析流程写在了这个函数中,仅供参考。

虽然说 gitoxide 库提供的子库都是比较底层的实现,刚开始的时候看它的文档找不到怎么才能实现自己的功能,于是使用 AI 查找,结果也不是很满意。后续翻阅代码仓库的时候发现这个仓库包含大量的测试用例,通过测试用例才最终找到了怎么实现自己想要的功能。

cd ..