上篇文章我们从原理上分析了,从网站暴露的 .git/ 目录中恢复整个 Git 仓库中的文件是可行的。当然,在实践中,我只使用了普通的仓库,并且这个泄漏的目录在网页是被 403 禁止访问的。
虽然说能够手动对这些文件进行提取,但是大一点的 Git 仓库文件非常多,手动操作非常耗费时间并且容易出错,既然有方法,而且是重复性有规律的操作,那么就可以用代码自动进行下载,由于最近痴迷 rust,所以这次的工具还是用 rust 来实现,并且之前在做其他工具的时候用到了 git2 库来直接操作 Git 仓库的,就不用自己去实现,开发起来也比较方便。但在开发这个工具的时候,却犯了难,git2 根本不适用于这个工具的实现,于是这次的实现,我选用了 gitoxide 库来实现,这个库把 git 一些底层操作都拆解成了一个小的库,非常适合这次的开发。
通过上一篇文章的分析,我们首先要通过 Git 仓库的配置文件获得分支名称才能获取到分支的对象 ID,而gix-config库就能够对配置文件进行解析,其中解析配置文件的代码会提供一系列事件结构,事件的定义如下:
pub enum Event<'a> {
/// A comment with a comment tag and the comment itself. Note that the
/// comment itself may contain additional whitespace and comment markers
/// at the beginning, like `# comment` or `; comment`.
Comment(Comment<'a>),
/// A section header containing the section name and a subsection, if it
/// exists. For instance, `remote "origin"` is parsed to `remote` as section
/// name and `origin` as subsection name.
SectionHeader(section::Header<'a>),
/// A name to a value in a section, like `url` in `remote.origin.url`.
SectionValueName(section::ValueName<'a>),
/// A completed value. This may be any single-line string, including the empty string
/// if an implicit boolean value is used.
/// Note that these values may contain spaces and any special character. This value is
/// also unprocessed, so it may contain double quotes that should be
/// [normalized][crate::value::normalize()] before interpretation.
Value(Cow<'a, BStr>),
/// Represents any token used to signify a newline character. On Unix
/// platforms, this is typically just `\n`, but can be any valid newline
/// *sequence*. Multiple newlines (such as `\n\n`) will be merged as a single
/// newline event containing a string of multiple newline characters.
Newline(Cow<'a, BStr>),
/// Any value that isn't completed. This occurs when the value is continued
/// onto the next line by ending it with a backslash.
/// A [`Newline`][Self::Newline] event is guaranteed after, followed by
/// either a ValueDone, a Whitespace, or another ValueNotDone.
ValueNotDone(Cow<'a, BStr>),
/// The last line of a value which was continued onto another line.
/// With this it's possible to obtain the complete value by concatenating
/// the prior [`ValueNotDone`][Self::ValueNotDone] events.
ValueDone(Cow<'a, BStr>),
/// A continuous section of insignificant whitespace.
///
/// Note that values with internal whitespace will not be separated by this event,
/// hence interior whitespace there is always part of the value.
Whitespace(Cow<'a, BStr>),
/// This event is emitted when the parser counters a valid `=` character
/// separating the key and value.
/// This event is necessary as it eliminates the ambiguity for whitespace
/// events between a key and value event.
KeyValueSeparator,
}
由于分支名是配置节后的小节名称,所以这些事件中,只需要关注SectionHeader事件。
接着,这个工具需要实现对 commmit 对象解析的功能,gix-object库能对仓库中的对象文件进行解析,比如这里的 commit 对象,在gix-object中就能够被解析为CommitRef结构,其结构定义如下:
pub struct CommitRef<'a> {
/// HEX hash of tree object we point to. Usually 40 bytes long.
///
/// Use [`tree()`](CommitRef::tree()) to obtain a decoded version of it.
#[cfg_attr(feature = "serde", serde(borrow))]
pub tree: &'a BStr,
/// HEX hash of each parent commit. Empty for first commit in repository.
pub parents: SmallVec<[&'a BStr; 1]>,
/// Who wrote this commit. Name and email might contain whitespace and are not trimmed to ensure round-tripping.
///
/// Use the [`author()`](CommitRef::author()) method to received a trimmed version of it.
pub author: gix_actor::SignatureRef<'a>,
/// Who committed this commit. Name and email might contain whitespace and are not trimmed to ensure round-tripping.
///
/// Use the [`committer()`](CommitRef::committer()) method to received a trimmed version of it.
///
/// This may be different from the `author` in case the author couldn't write to the repository themselves and
/// is commonly encountered with contributed commits.
pub committer: gix_actor::SignatureRef<'a>,
/// The name of the message encoding, otherwise [UTF-8 should be assumed](https://github.com/git/git/blob/e67fbf927dfdf13d0b21dc6ea15dc3c7ef448ea0/commit.c#L1493:L1493).
pub encoding: Option<&'a BStr>,
/// The commit message documenting the change.
pub message: &'a BStr,
/// Extra header fields, in order of them being encountered, made accessible with the iterator returned by [`extra_headers()`](CommitRef::extra_headers()).
pub extra_headers: Vec<(&'a BStr, Cow<'a, BStr>)>,
}
在这里面,对我们有用的字段就是tree和parents,有了这个提交的树对象ID,我们就能够解析出提交的目录结构以及其中包含的文件,有了提交的父提交,我们就能够通过父提交对象来找到仓库中所有的提交对象 ID 。
由于 git 中文件对象也就是 blob 对象,是直接使用 zlib 压缩的文件,所以我们在处理 blob 对象的时候,只需要将对象解压到正确的目录即可,而树对象,同样需要gix-object进行处理,其能够将树对象解析为TreeRef结构,该结构定义如下:
pub struct TreeRef<'a> {
/// The directories and files contained in this tree.
///
/// Beware that the sort order isn't *quite* by name, so one may bisect only with a [`tree::EntryRef`] to handle ordering correctly.
#[cfg_attr(feature = "serde", serde(borrow))]
pub entries: Vec<tree::EntryRef<'a>>,
}
pub struct EntryRef<'a> {
/// The kind of object to which `oid` is pointing.
pub mode: tree::EntryMode,
/// The name of the file in the parent tree.
pub filename: &'a BStr,
/// The id of the object representing the entry.
// TODO: figure out how these should be called. id or oid? It's inconsistent around the codebase.
// Answer: make it 'id', as in `git2`
#[cfg_attr(feature = "serde", serde(borrow))]
pub oid: &'a gix_hash::oid,
}
其中,mode字段能够体现当前的实体是目录还是文件,oid 就是对象的 ID。
最后别忘了index文件,index文件的解析需要使用gix-index库来实现。
基本功能分析完毕,下面就是主要的代码逻辑部分:
// 获得仓库的所有分支名称
fn get_branches(){
获得 config 文件的二进制流
使用 gix_config::parse::from_bytes 解析二进制流
将得到的分支名称格式化为 'refs/heads/<branches>' 路径
返回格式化后的字符串数组
}
// 解析 commit 对象
fn dump_commit(commit_sha1){
获得 commit 对象二进制流
使用 CommitRef::from_bytes 解析二进制流
处理解析后的 CommitRef 结构
获取父提交对象 ID 数组
获取树对象 ID
返回父亲提交对象 ID 数组以及树对象 ID
}
// 解析树对象
fn dump_tree(tree_sha1){
获得树对象二进制流
使用 TreeRef::from_bytes 解析二进制流
处理解析后的 TreeRef 结构,获得树对象 ID 以及二进制对象 ID
返回树对象 ID 以及二进制对象 ID
}
具体实现可以查看这个代码文件,该代码仅实现了一层的对象处理功能,并返回了详细的格式化信息,其目的是使得调用方能自定义自己的解析流程,我将自己的解析流程写在了这个函数中,仅供参考。
虽然说 gitoxide 库提供的子库都是比较底层的实现,刚开始的时候看它的文档找不到怎么才能实现自己的功能,于是使用 AI 查找,结果也不是很满意。后续翻阅代码仓库的时候发现这个仓库包含大量的测试用例,通过测试用例才最终找到了怎么实现自己想要的功能。