2009年12月15日星期二

Virtual box shared folders 備忘

Virtual box 3.1 已支援 OpenGl 2.1,MCore 的 Linux port 也可繼續。

已往我使用 smbfs 來 mount Windows 的網絡磁碟機:

//192.168.0.1/mcore3d /home/ricky/mcore3d smbfs iocharset=utf8,unicode,username=xxx,password=xxx 0 0

不過很容易會有權限或 encoding 的問題,而某個 library 更新之後又會有機會使原有設定失效。

比較之下,Virtual box 提供的 shared folders 就可靠得多:

mcore3d /home/ricky/mcore3d vboxsf defaults 0 0

2009年12月2日星期三

為灰鼠捉蟲

使用了 Squirrel 已好一些日子,終於給我遇到了臭蟲。這臭蟲的表徵是不確定的行為,debug/release/,run-with/without debugger 都會產生不同的結果。這一類不確定的行為不是多緒就是錯誤記憶體存取做成,而面對著陌生的 code base,最好的捉蟲方法就是使用專門的工具,如 Intel Parallel Studio 提供的 Parallel Inspector

最終找到錯誤源於一處被 reference 的記憶體給 realloc 了;餘下的就交給 Squirrel 作者處理

Invalid memory access

2009年10月21日星期三

Squirrel remote debugger

以下是完成了大約三分之一的 Squirrel 遠程除錯器。當中的製作過程比想像中簡單,皆因 Squirrel 已有一個除錯伺服器的實作 sqdbg,而我需要做的只是客戶端與 Gui。

Sqdbg 的編程界面僅有四個函數,使用 Xml 來進行客戶端溝通。如今加減中斷點,下一行,跳出跳入和暫停繼續等等已經完成,接下來還有堆疊,變量監視等工夫。

除了結合自家的遊戲引擎,它也會獨立出來貢獻給 Squirrel 社群。

SqdbgClient

2009年9月17日星期四

In-Game C++ Memory Profiler

忙碌了兩個多月,這篇有可出現在 Gems 系列的 "In-Game Memory Profiler" 終於完成。製作這類 Profiler 已經不知多少次了,但這次是最滿意的;可以媲美需付款的產品 (其實類似的 C++ memory profiler 產品真的少之又少),甚至更好。就是它,讓我知道 Bullet 每次 step simulation 都有好幾千個即用即棄的 memory allocation/de-allocation;後來我把它的 default pool size 增大,問題也就解決了。

In-Game-Memory-Profiler

2009年9月5日星期六

弱指針的重要性

雖然有許多有關智能指針 (Smart pointer) 的文章和範例,但關於弱指針的就相對寥寥無幾。縱然有所提及,也只是草草幾筆說弱指針是用來避免 "循環引用 (Cyclic reference)" 云云。就此問題我嘗試用一些例子去補充說明弱指針何時用,為什麼用。

首先讓我們看一個遊戲引擎中典型的資料管理系統例子:

因為要節省內存的需求,Texture (紋理) 被設計成能夠給多個 Model (三維模型) 共享。而 Texture 的存在目的,就是留給 Model 作為繪畫時使用;明顯地,Model 可用 Smart Pointer 來引用 Texture;相對地 NameToTextureMap (方便用名稱來作資料搜尋) 該使用甚麼來引用 Texture 就沒有那麼明顯了。先看看使用不同的指針會有什麼的後果吧。
  • 裸指針 - 這將會是災難性的選擇,因為一旦再沒有 Model 引用 Texture, NameToTextureMap 就會在毫無知會的情況下指向一件非法的物件。

  • 智能指針 - 為了解決裸指針的問題,選擇智能指針像是合理的做法。不過請大家想一想,一旦 NameToTextureMap 引用了 Texture 以後,何時才可刪除引用呢?假如不去刪除引用,Texture 的生命期就會與 NameToTextureMap 同步化,失去了原先使用智能指針作為純粹被 Model 共享的意義。
這時就是讓弱指針發揮的時候。弱指針的特性就是當它指向的物件被銷毀之時,弱指針的值就會瞬間變為零;因此,讓 NameToTextureMap 使用弱指針,再定期把零值的弱指針由容器中刪除,以上的 UML 設計就可忠實地實作出來。

總結來說,智能指針旨不在於提供程序員不加思索就可使用的工具,它是一種能幫助你正確地實行擁有權共享的工具。就算是聲稱擁有自動內存管理的語言如 Java, C#, Python, Ruby 等等都一定會有弱指針這個概念,否則內存洩漏是很難避免的。假若你在開發 C++ 程序的時候遇到許多關於物件擁有權的問題,請記得弱指針的存在。

2009年9月3日星期四

Syntax highlight in Notepad++

從互聯網找到個給 Squirrel 語言著色的 Notepad++ config 檔案,也可作為日後 MCore3D Studio 結合 Scintilla Control 的藍圖。把以下的 Xml 儲存為 "userDefineLang.xml" 再放入 Notepad++ 的安置目錄就行了。

<NotepadPlus>
<UserLang name="Squirrel" ext="nut">
<Settings>
<Global caseIgnored="no" />
<TreatAsSymbol comment="yes" commentLine="yes" />
<Prefix words1="no" words2="no" words3="no" words4="no" />
</Settings>
<KeywordLists>
<Keywords name="Delimiters">&quot;00&quot;00</Keywords>
<Keywords name="Folder+"></Keywords>
<Keywords name="Folder-"></Keywords>
<Keywords name="Operators">- ! ( ) , . : ; ? [ ] { } + &lt; = &gt;</Keywords>
<Keywords name="Comment">1/* 2*/ 0//</Keywords>
<Keywords name="Words1">break case catch class clone continue const default delegate delete else enum extends for function if in null resume return switch this throw try typeof parent yield constructor instanceof true false static do while foreach</Keywords>
<Keywords name="Words2">local</Keywords>
<Keywords name="Words3">ARGS</Keywords>
<Keywords name="Words4"></Keywords>
</KeywordLists>
<Styles>
<WordsStyle name="DEFAULT" styleID="11" fgColor="000000" bgColor="FFFFFF" fontName="" fontStyle="0" />
<WordsStyle name="FOLDEROPEN" styleID="12" fgColor="000000" bgColor="FFFFFF" fontName="" fontStyle="0" />
<WordsStyle name="FOLDERCLOSE" styleID="13" fgColor="000000" bgColor="FFFFFF" fontName="" fontStyle="0" />
<WordsStyle name="KEYWORD1" styleID="5" fgColor="0000FF" bgColor="FFFFFF" fontName="" fontStyle="1" />
<WordsStyle name="KEYWORD2" styleID="6" fgColor="006262" bgColor="FFFFFF" fontName="" fontStyle="1" />
<WordsStyle name="KEYWORD3" styleID="7" fgColor="007777" bgColor="FFFFFF" fontName="" fontStyle="3" />
<WordsStyle name="KEYWORD4" styleID="8" fgColor="000000" bgColor="FFFFFF" fontName="" fontStyle="0" />
<WordsStyle name="COMMENT" styleID="1" fgColor="008000" bgColor="FFFFFF" fontName="" fontStyle="2" />
<WordsStyle name="COMMENT LINE" styleID="2" fgColor="008000" bgColor="FFFFFF" fontName="" fontStyle="2" />
<WordsStyle name="NUMBER" styleID="4" fgColor="FF8000" bgColor="FFFFFF" fontName="" fontStyle="0" />
<WordsStyle name="OPERATOR" styleID="10" fgColor="000080" bgColor="FFFFFF" fontName="" fontStyle="1" />
<WordsStyle name="DELIMINER1" styleID="14" fgColor="7E7E7E" bgColor="FFFFFF" fontName="" fontStyle="2" />
<WordsStyle name="DELIMINER2" styleID="15" fgColor="000000" bgColor="FFFFFF" fontName="" fontStyle="0" />
<WordsStyle name="DELIMINER3" styleID="16" fgColor="000000" bgColor="FFFFFF" fontName="" fontStyle="0" />
</Styles>
</UserLang>
</NotepadPlus>

2009年8月29日星期六

Mesa 3D

多年沒有注視的 Mesa 3D 原來已經支援 OpenGL 2.1,亦表示它包含了 Glsl 的編輯器與運行期軟件。但更令我感到驚喜的是它的驅動層是可以使用 Direct3D9 作為實作。這個實作應該是基於 GlDirect 為藍圖,把一個個 OpenGL 指令變換成 Direct3D 的函數。
理想地,我不必再為不同平台間的 Graphics API 而煩惱;因為 OpenGL 本身就是設計成可擴充的,而 Mesa3D 又提供了一個很好的起步點。我的引擎只要沿用 OpenGL,再對 Mesa3D 的驅動層稍為修改就應該可在 XBox 上跑。
這兩天會花點時間對這想法的實際可行性多作研究。

imageimage

2009年8月5日星期三

Restrict 指針

一般時候我們的編譯器都不知道兩個指針所指的位置是否相同或者有所重疊,大大降低了它的優化效果。就拿 Matrix3 * Vector3 為例:

void Matrix::MulVector(const Vec3& v, Vec3& result) {
result.x = m00 * v.x + m01 * v.y + m02 * v.z;
result.y = m10 * v.x + m11 * v.y + m12 * v.z;
result.z = m20 * v.x + m21 * v.y + m22 * v.z;
}

很簡單的三行代碼,然而它隱藏了一個效能上的問題。堂 result 被賦予了新的值後,編譯器認為 v 的值也可能被更改了 (v 和 result 也是 reference,pointer 的一類)。因此原本可以留在 register 裡的 v.x, v.y 和 v.z 被迫從新由記憶體裡閱讀回來。看看VC2008 編譯成的機器碼吧:

result.x = m00 * v.x + m01 * v.y + m02 * v.z;
mov eax,dword ptr [esp+4]
movss xmm0,dword ptr [ecx+8]
mulss xmm0,dword ptr [eax+8]
movss xmm1,dword ptr [ecx+4]
mulss xmm1,dword ptr [eax+4]
mov edx,dword ptr [esp+8]
addss xmm0,xmm1
movss xmm1,dword ptr [eax]
mulss xmm1,dword ptr [ecx]
addss xmm0,xmm1
movss dword ptr [edx],xmm0

result.y = m10 * v.x + m11 * v.y + m12 * v.z;
movss xmm0,dword ptr [ecx+0Ch]
mulss xmm0,dword ptr [eax]
movss xmm1,dword ptr [ecx+14h]
mulss xmm1,dword ptr [eax+8]
addss xmm0,xmm1
movss xmm1,dword ptr [ecx+10h]
mulss xmm1,dword ptr [eax+4]
addss xmm0,xmm1
movss dword ptr [edx+4],xmm0

result.z = m20 * v.x + m21 * v.y + m22 * v.z;
movss xmm0,dword ptr [ecx+18h]
mulss xmm0,dword ptr [eax]
movss xmm1,dword ptr [ecx+20h]
mulss xmm1,dword ptr [eax+8]
addss xmm0,xmm1
movss xmm1,dword ptr [ecx+1Ch]
mulss xmm1,dword ptr [eax+4]
addss xmm0,xmm1
movss dword ptr [edx+8],xmm0

這時候使用 restrict 就可幫上編譯器把。請注意,VC2008 的 __restrict 只對指針生效:

void Matrix::MulVector(const Vec3& v_, Vec3& result_) {
const Vec3* __restrict v = &v_;
Vec3* __restrict ret = &result_;
result->x = m00 * v->x + m01 * v->y + m02 * v->z;
result->y = m10 * v->x + m11 * v->y + m12 * v->z;
result->z = m20 * v->x + m21 * v->y + m22 * v->z;
}

從新編譯後的機器碼:

result->x = m00 * v->x + m01 * v->y + m02 * v->z;
mov eax,dword ptr [esp+4]
movss xmm1,dword ptr [eax+4]
movss xmm0,dword ptr [eax+8]
movss xmm2,dword ptr [eax]
movss xmm3,dword ptr [ecx+4]
movss xmm4,dword ptr [ecx+8]
mov eax,dword ptr [esp+8]
mulss xmm3,xmm1
mulss xmm4,xmm0
addss xmm3,xmm4
movaps xmm4,xmm2
mulss xmm4,dword ptr [ecx]
addss xmm3,xmm4

result->y = m10 * v->x + m11 * v->y + m12 * v->z;
movss xmm4,dword ptr [ecx+10h]
movss dword ptr [eax],xmm3
movss xmm3,dword ptr [ecx+0Ch]
mulss xmm3,xmm2
mulss xmm4,xmm1
addss xmm3,xmm4
movss xmm4,dword ptr [ecx+14h]
mulss xmm4,xmm0
addss xmm3,xmm4
movss dword ptr [eax+4],xmm3

result->z = m20 * v->x + m21 * v->y + m22 * v->z;
movss xmm3,dword ptr [ecx+18h]
mulss xmm3,xmm2
movss xmm2,dword ptr [ecx+1Ch]
mulss xmm2,xmm1
movss xmm1,dword ptr [ecx+20h]
addss xmm3,xmm2
mulss xmm1,xmm0
addss xmm3,xmm1
movss dword ptr [eax+8],xmm3

可以看到記憶體閱讀操作減少了。

到最後,其實使用局部變量也可達到類似效果:

void Matrix::MulVector(const Vec3& v, Vec3& result) {
const float x = v.x;
const float y = v.y;
const float z = v.z;

result.x = m00 * x + m01 * y + m02 * z;
result.y = m10 * x + m11 * y + m12 * z;
result.z = m20 * x + m21 * y + m22 * z;
}

2009年6月21日星期日

引擎開發進度

有了新成員的加盟,遊戲引擎的開發進程加快,以下是一些測試程式的截圖。
  • 漂亮又便宜的 God Ray


  • Cube map + Normal mapping


  • 使用 Bullet 製作的 Physics component


  • 場景編輯器

2009年6月17日星期三

於特定線程設置中斷點

正在開發一個 memory profiler,為了針對多線程部分的除錯,我須要設置一個只對特定線程生效的中斷點。在 Visual Studio 中,有兩個非常類似的方法可達到目的:
  1. 右擊您的 break point 然後使用 break point filter

    選擇線程 id

  2. 另一個方法是使用 break point condition


2009年6月6日星期六

不要被 Windows.h 強暴

引擎的物理部份開始開發,我滿心喜悅地把 Bullet Pyhsics Engine 加入專案,但"咚"的一聲告訴我編譯器掛丟了。誰在作怪呢?當編譯器指向那一向相安無事的數學 max 函數,我就知道是 Bullet include 了 Windows.h。錯就錯在 Bullet 在 btQuickprof.h 哪裡包括了 Windows.h,迫使其他檔案也一併把 Windows.h include 過來,本來潔淨的命名空間就這樣被成千上萬的 macros 如 min, max, CreateWindow, DrawText 等等污染了。
其實 Windows.h 是可以避免在標頭檔中出現的:
  • 盡可能把函數定義由標頭檔移到源碼檔去
  • 大多數基本型別如 DWORD, LARGE_INTEGER 等在標頭檔中可用 unsigned long, __int64 來取代,再在源碼檔用 reinterpret_cast 返回原本的型別。大多數的 struct 指針都可以前置聲明(forward declarate) 如 typedef struct _GUID GUID;
  • 更甚者還會使用 char mMutex[24]; 來取代 CRITICAL_SECTION mMutex; 再加上編譯期 assert 確保兩者內存大小匹配:STATIC_ASSERT(sizeof(mMutex) == sizeof(CRITICAL_SECTION));
把 Windows.h 由標頭檔中抽離以後,不單止命名空間的問題解決了,連編譯的速度都快了哩!

2009年5月30日星期六

Squirrel 單元測試框架

暫時還無法找到一個給予 Squirrel 使用的單元測試框架,唯有自己做吧。Lualuaunit 是一個好的起點,不需一天的時間就把它移到 Squirrel,給它命名為 squnit (下載 / Download)。以下是一些使用範例:

dofile("squnit.nut", true);

class TestToto
{
a = null;
s = null;

function setUp()
{
// Set up tests
a = 1;
s = "hop";
}

function test1_withFailure()
{
print("some stuff test 1\n");
assertEquals(a , 1);
// Will fail
assertEquals(a , 2);
assertEquals(a , 2);
}

function test2_withFailure()
{
print("some stuff test 2\n");
assertEquals(a , 1);
assertEquals(s , "hop");
// Will fail
assertEquals(s , "bof");
assertEquals(s , "bof");
}

function test3()
{
print("some stuff test 3\n");
assertEquals(a , 1);
assertEquals(s , "hop");
assertEquals(typeof a, "integer");
assertClose(0.01, -0.01, 0.02);
}
} // TestToto

class TestTiti
{
a = null;
s = null;

function setUp()
{
a = 1;
s = "hop";
print("TestTiti.setUp\n");
}

function tearDown()
{
// Some tearDown() code if necessary
print("tearDown\n");
}

function test1_withFailure()
{
print("some stuff test 1\n");
assertEquals(a , 1);
// Will fail
assertEquals(a , 2);
assertEquals(a , 2);
}

function test2_withFailure()
{
print("some stuff test 2\n");
assertEquals(a , 1);
assertEquals(s , "hop");
// Will fail
assertEquals(s , "bof");
}

function test3()
{
print("some stuff test 3\n");
assertEquals(a , 1);
assertEquals(s , "hop");
}
} // TestTiti

// Simple test functions that were written previously can be integrated in luaunit too
function test1_withFailure()
{
assert(1 == 1);
// Will fail
assert(1 == 2);
}

function test2_withFailure()
{
assert("a" == "a");
// Will fail
assert("a" == "b");
}

function test3()
{
assert(1 == 1);
assert("a" == "a");
}

TestFunctions <- wrapFunctions("test1_withFailure", "test2_withFailure", "test3");

// SqUnit().run("test2_withFailure"); // Run only one test function
// SqUnit().run("test1_withFailure");
// SqUnit().run("TestToto"); // Run only on test class
// SqUnit().run("TestTiti:test3"); // Run only one test method of a test class
SqUnit().run(); // Run all tests

還可以使用 squnit 本身來測試自己:

/* testSqunit.nut
Description: Tests for the squnit testing framework
Author: Ricky Lung (http://mtlung.blogspot.com/)
Version: 1.0

License: X11 License

This set of files is published under the X11 License. You can do
more or less anything you want to do with it.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE X
CONSORTIUM BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/

// This is a bit tricky since the test uses the features that it tests.

dofile("squnit.nut", true);

class TestSqUnit
{
function mypcall(func, ...)
{
local args = array(0);
for(local i=0; i<vargc; ++i)
args.push(vargv[i]);

try {
func.pacall(args);
return true;
} catch(e) {
return false;
}
}

function testAssertError()
{
local has_error = !mypcall(error, null, "coucou");
assert(has_error == true);
assertError(error, null, "coucou");
has_error = !mypcall(assertError, null, error, null, "coucou");
assert(has_error == false);

local f = function() {}
has_error = !mypcall(f, null);
assert(has_error == false);
has_error = !mypcall(assertError, null, f, null);
assert(has_error == true);

// multiple arguments
local multif = function(a, b, c) {
if(a == b && b == c) return;
error("three arguments not equal");
}

assertError(multif, null, 1, 1, 3);
assertError(multif, null, 1, 3, 1);
assertError(multif, null, 3, 1, 1);

has_error = !mypcall(assertError, null, multif, null, 1, 1, 1);
assert(has_error == true);
}

function testAssertEquals()
{
assertEquals(1, 1);
local has_error = !mypcall(assertEquals, 1, 2);
assert(has_error == true);
}

function testAssertClose()
{
assertClose(1, 1);
assertClose(-1, -1);
assertClose(0.1, -0.1, 0.2);
local has_error = !mypcall(assertEquals, 0.1, -0.1, 0.19);
assert(has_error == true);
}

function XtestXpcall()
{
local f = function() {
error("[this is a normal error]");
}
local g = function(f) {
f();
}
g(f);
}
} // TestSqUnit

//SqUnit().run("TestSqUnit.testAssertEquals"); // Will execute only one test
//SqUnit().run("TestSqUnit"); // Will execute only one class of test
SqUnit().run(); // Will execute all tests

2009年5月23日星期六

串列矩陣乘法優化

埋頭苦幹於 Squirrel 與其捆綁庫已有一個月多,偶爾間找一些好玩的來調劑一下,那就是有關 SIMD 的優化了。先呈上源碼:

// simdMatrixMul.h
#include <xmmintrin.h>

#if USE_ALIGNED
# define _MM_LOAD _mm_load_ps
# define _MM_STORE _mm_store_ps
# define ASSERT_ALIGNED(p) assert(intptr_t(p) % 16 == 0)
#else
# define _MM_LOAD _mm_loadu_ps
# define _MM_STORE _mm_storeu_ps
# define ASSERT_ALIGNED(p)
#endif

/*! SIMD chained matrix multiplication
Aliasing is allowed for the parameters
\param result Pointer to 16 floats as the return value
\param mats An array of pointer that point to a set of 16 floats
\param count Number of pointer in mats

Example:
\code
Matrix44 m1, m2, m3;
Matrix44 result;

// Perform result = m1 * m2 * m3;
float* mats[] = { (flaot*)(&m1), (float*)(&m2), (float*)(&m3) };
simdChainedMatrixMul((float*)&result, mats, 3);
\endcode
*/
void simdChainedMatrixMul(float* result, const float* mats[], size_t count)
{
assert(count >= 2);
ASSERT_ALIGNED(result);

__m128 ret[4];
__m128 x0, x1, x2, x3;

// Load the first matrix into ret
ret[0] = _MM_LOAD(mats[0] + 0*4);
ret[1] = _MM_LOAD(mats[0] + 1*4);
ret[2] = _MM_LOAD(mats[0] + 2*4);
ret[3] = _MM_LOAD(mats[0] + 3*4);

for(size_t j=1; j<count; ++j) {
ASSERT_ALIGNED(mats[j]);

// Prefetch the next matrix, may not usefull at all, test it case by case.
_mm_prefetch(reinterpret_cast<const char*>(&mats[j+1]), _MM_HINT_NTA);

__m128 x4 = _MM_LOAD(mats[j] + 0*4);
__m128 x5 = _MM_LOAD(mats[j] + 1*4);
__m128 x6 = _MM_LOAD(mats[j] + 2*4);
__m128 x7 = _MM_LOAD(mats[j] + 3*4);

for(size_t i=0; i<4; ++i) {
x1 = x2 = x3 = x0 = ret[i];
x0 = _mm_shuffle_ps(x0, x0, _MM_SHUFFLE(0,0,0,0));
x1 = _mm_shuffle_ps(x1, x1, _MM_SHUFFLE(1,1,1,1));
x2 = _mm_shuffle_ps(x2, x2, _MM_SHUFFLE(2,2,2,2));
x3 = _mm_shuffle_ps(x3, x3, _MM_SHUFFLE(3,3,3,3));

x0 = _mm_mul_ps(x0, x4);
x1 = _mm_mul_ps(x1, x5);
x2 = _mm_mul_ps(x2, x6);
x3 = _mm_mul_ps(x3, x7);

x2 = _mm_add_ps(x2, x0);
x3 = _mm_add_ps(x3, x1);
x3 = _mm_add_ps(x3, x2);

ret[i] = x3;
}
}

_MM_STORE(result + 0*4, ret[0]);
_MM_STORE(result + 1*4, ret[1]);
_MM_STORE(result + 2*4, ret[2]);
_MM_STORE(result + 3*4, ret[3]);
}
這一函數專門計算串列矩陣乘法,非常適合用於 transform traversal 或 skeleton animation 等等。因為所有相乘後的臨時結果都存儲在 SSE 寄存器內,沒有多餘的記憶體移動指令被浪費丟。兩項優化的結果使得 simdChainedMatrixMul 比一般矩陣乘法快三倍,以下是一個 Entity traversal 的測試程式:

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

// Benchmarking options
#define USE_ALIGNED 1
#define USE_SIMD 1

#include "simdChainedMatrixMul.h"

_MM_ALIGN16 class Matrix44
{
public:
void randomize()
{
for(int i=0; i<4; ++i) for(int j=0; j<4; ++j)
data[i][j] = 0.96f * ((float(rand()) / RAND_MAX) * 2 - 1);
}

float* operator[](size_t i) { return data[i]; }
const float* operator[](size_t i) const { return data[i]; }

// No aliasing is allowed for the 2 parameters
void mul(Matrix44* __restrict result, const Matrix44* __restrict rhs)
{
assert(result != rhs);
assert(result != this);

/* for(size_t i=0; i<4; ++i) for(size_t j=0; j<4; ++j) {
float sum = 0;
for(size_t k=0; k<4; ++k)
sum += data[i][k] * rhs->data[k][j];
result->data[i][j] = sum;
}
return;*/

result->m00 = m00 * rhs->m00 + m01 * rhs->m10 + m02 * rhs->m20 + m03 * rhs->m30;
result->m01 = m00 * rhs->m01 + m01 * rhs->m11 + m02 * rhs->m21 + m03 * rhs->m31;
result->m02 = m00 * rhs->m02 + m01 * rhs->m12 + m02 * rhs->m22 + m03 * rhs->m32;
result->m03 = m00 * rhs->m03 + m01 * rhs->m13 + m02 * rhs->m23 + m03 * rhs->m33;

result->m10 = m10 * rhs->m00 + m11 * rhs->m10 + m12 * rhs->m20 + m13 * rhs->m30;
result->m11 = m10 * rhs->m01 + m11 * rhs->m11 + m12 * rhs->m21 + m13 * rhs->m31;
result->m12 = m10 * rhs->m02 + m11 * rhs->m12 + m12 * rhs->m22 + m13 * rhs->m32;
result->m13 = m10 * rhs->m03 + m11 * rhs->m13 + m12 * rhs->m23 + m13 * rhs->m33;

result->m20 = m20 * rhs->m00 + m21 * rhs->m10 + m22 * rhs->m20 + m23 * rhs->m30;
result->m21 = m20 * rhs->m01 + m21 * rhs->m11 + m22 * rhs->m21 + m23 * rhs->m31;
result->m22 = m20 * rhs->m02 + m21 * rhs->m12 + m22 * rhs->m22 + m23 * rhs->m32;
result->m23 = m20 * rhs->m03 + m21 * rhs->m13 + m22 * rhs->m23 + m23 * rhs->m33;

result->m30 = m30 * rhs->m00 + m31 * rhs->m10 + m32 * rhs->m20 + m33 * rhs->m30;
result->m31 = m30 * rhs->m01 + m31 * rhs->m11 + m32 * rhs->m21 + m33 * rhs->m31;
result->m32 = m30 * rhs->m02 + m31 * rhs->m12 + m32 * rhs->m22 + m33 * rhs->m32;
result->m33 = m30 * rhs->m03 + m31 * rhs->m13 + m32 * rhs->m23 + m33 * rhs->m33;
}

union {
// Individual elements
struct { float
m00, m01, m02, m03,
m10, m11, m12, m13,
m20, m21, m22, m23,
m30, m31, m32, m33;
};
// As a 2 dimension array
float data[4][4];
};
}; // Matrix44

//! A simple Entity with single list structure.
class Entity
{
public:
#if USE_ALIGNED
void* operator new(size_t size) { return _aligned_malloc(size, 16); }
void operator delete(void* p) { _aligned_free(p); }
#endif

Entity() : parent(NULL), child(NULL)
{
ASSERT_ALIGNED(&matrix);
matrix.randomize();
}
~Entity() { delete child; }

Entity* addChild(Entity* e)
{
assert(e->child == NULL);
assert(!child);
if(!child) {
this->child = e;
e->parent = this;
}
else {
e->child = child;
child->parent = e;
this->child = e;
e->parent = this;
}
return e;
};

Matrix44 calculateWorldMatrix1()
{
Matrix44 result = matrix;
Entity* e = this;
while((e = e->parent) != NULL) {
Matrix44 tmp(result);
tmp.mul(&result, &e->matrix);
}
return result;
}

Matrix44 calculateWorldMatrix2()
{
const float* matrixArray[1024] = {0};
size_t count = 0;
Entity* e = this;
do {
matrixArray[count++] = (float*)(&(e->matrix));
if(count >= 1024)
exit(-1);
} while((e = e->parent) != NULL);

Matrix44 result;
simdChainedMatrixMul((float*)(&result), matrixArray, count);

return result;
}

Matrix44 matrix;
Entity* parent, *child;
// char padding[1024*1024]; // To test the effect of cache miss
};

int main(int argc, char* argv[])
{
srand(123);
Entity* root = new Entity();

Entity* lastChild = root;
for(size_t i=1000; i--;)
lastChild = lastChild->addChild(new Entity());

for(size_t i=0; i<10; ++i) {
// Print out a summation variable to prevent compiler optimization
// that turns the whole benchmark into nothing
float sum = 0;

clock_t t1 = clock();

if(!USE_SIMD) for(size_t i=10000; i--;)
sum += lastChild->calculateWorldMatrix1().m00;
else for(size_t i=10000; i--;)
sum += lastChild->calculateWorldMatrix2().m00;

clock_t t2 = clock();

printf("%f, dummy:%i\n", float(t2 - t1) / CLOCKS_PER_SEC, sum);
}

delete root;
return 0;
}

2009年4月26日星期日

明日之星 - 小松鼠 Squirrel

(圖片來源:treehugger.com)

開始了把腳本語言加入遊戲引擎的工作,心目中的候選語言有 LuaPythonRubyJavascript V8 等等;最後選擇了這一隻小可愛 - Squirrel。小松鼠 Squirrel 由 Alberto Demichelis 所創立,他曾經是知名遊戲引擎公司 Crytek 的 System Architect,Far Cry/CRYEngine 裡的 Lua binding 庫都由他負責。就憑著他對 Lua 的透徹認識和深厚的編程功力,他決定打造一套比 Lua 更優秀的語言。

一向有使用 Lua 的我非常欣賞它的堆疊設計與小巧的機器碼,可惜它的語法對我來說不太自然,甚至未有對類別和標準字元編碼作直接支援。Squirrel 正好解決了這些已困擾我多時的問題,除此之外還加入了類似 C# 的 delegate 和 attribute。以下是 Squirrel 擁有的特色:
  • 動態型別檢查 - Dynamic typing
  • 委托模式 - Delegation
  • 類別與繼承 - Classes & inheritance
  • 類別属性 - Attribute
  • 運算元重載 - Operator overloading
  • 高階函數 - Higher order functions
  • 迭代發生器 - Generators
  • 協程 - Cooperative threads (coroutines)
  • 最佳化遞歸 - Tail recursion
  • 異常 - Exception handling
  • 實時自動內儲管理 - Automatic memory management (CPU bursts free; mixed approach ref counting/GC)
  • 弱引用 - Weak references
  • C/C++ 風格的語法
  • 可支援 16 bit 字串
  • 支援運行期即時編譯 - Just in time compilation
  • 支援 32 位和 64 位架構
  • 整個運行庫不用七千行源碼,200k byte 的機械
  • 遠程調試 - Remote debugging though TCP/IP
  • 與 Visual Studio 的整合

在技術層面來看,Squirrel 絕對適合用作遊戲腳本語言,現在還欠缺的是一群使用者。小弟也希望它能夠茁壯成長,正在為它開發一套模板式的 C++ 結合庫,下次再和大家分享。

2009年3月31日星期二

簡單地閱讀整個文件

有時你會想閱讀整個文件,而不是一行行或一個固定大小的緩衝區。這裡有一種方法來完成該工作:

std::ifstream file("myFile", std::ios_base::binary);
if(file) {
std::ostringstream buffer;
buffer << file.rdbuf();
file.close();

std::string data = buffer.str();
}

就這麼簡單。該文件的內容會複製到 ostringstream 去。
相同的代碼可以用來複製文件,您只需要把 ostringstream 更換成一個 ofstream。
請注意,以上代碼旨在簡潔,未必是高效的實作。

2009年3月19日星期四

Visual Studio 插件推介 - RockScroll

這個插件把原本平平無奇的卷軸棒躍身一變成為整個文件的預覽。

還有,當你雙擊選擇一個字,所有出現過的地方將自動突顯出來。



現在就下載 RockScroll 並安裝它!

2009年3月6日星期五

Accumulative Screen Space Ambient Occlusion

This is my attempt to combine Real-Time Reprojection Cache and Screen Space Ambient Occlusion. Using such caching scheme, the spatio-temporal coherence nature of the SSAO algorithm can be exploited. You can download the demo with shader source here.


Add Video
The name "Accumulative SSAO" comes from the fact that the occlusion value is accumulated and averaged over a number of frames. The algorithm itself is quite independent of how the occlusion is calculated and here I will assume the reader is familiar with SSAO implementation such as those from Crysis and Startcraft II.

The pipeline

For every frame,

  1. The scene was rendered using deferred shading technique, producing the color, normal and depth buffers.
  2. A number of random vectors were generated in CPU (where in usual SSAO these vectors only generated once in the program).
  3. The normal and depth buffer are then utilized to calculate the occlusion value in the SSAO pass.
  4. Instead of writing the occlusion value to the final output, it would combine with the previous frame's accumulated occlusion value and then written to a second accumulation buffer.
  5. A blur pass can optionally apply to the most updated accumulated occlusion buffer.
  6. The color buffer was then combined with the occlusion value to product the final result, also the two accumulation buffers were switched with each other.
Re-projection

The re-projection happens in the SSAO pass when it tries to access the previous frame's occlusion value. Having the eye-space 3d position for each pixel, we can transform that into a texture coordinate by using a matrix (and a perspective division afterward), lets call it the delta matrix. This matrix is calculated on CPU as:
bias = translation(0.5, 0.5, 0.5) * scale(0.5, 0.5, 0.5)
deltaMatrix = bias * lastFrameProjection * lastFrameView * currentFrameView.inverse()

In simple words, for each current frame's pixel, we are trying to locate their corresponding pixel coordinate on the last frame. If there is no camera movements, the two coordinates should be the same.

Accumulative AO

With the re-projection working, the current frame's occlusion value can be combined with the previous one with the following accumulation formula:
currentAo = currentAo / 30.0 + lastFrameAo * 29.0 / 30.0;

In order to make something interesting for the above equation, the current occlusion value should not be the same as the previous one. Therefore, a new set of sampling position should be generated for each frame, this can be done by re-generating the random unit sphere samples or the dithering texture every frame. In a loosely sense, it is doing a Monte Carlo Integration over the time domain. To achieve better visual quality, more frames should be taken over the time.

As each frame's AO value will also depends on the last few frames, there will be some time delay for the AO to become up-to-date in a dynamic scene. However, by changing the numerator and denominator in the equation, the trade-off between quality and responsiveness can be adjusted.

Cache-miss consideration

Up to now the cache miss problem of the re-projection is not yet addressed. A cache miss will happen if somewhere in the scene that cannot be seen before becoming visible now, due to camera or object movement. Such a cache miss can be detected by comparing the current pixel's depth value with it's re-projected counterpart. If the two values differed by a certain threshold, a cache miss is detected. And to do this, the last frame's depth value is needed. Instead of using a separated texture to store the last frame's depth value, the depth can be encoded and stored together with the accumulative AO value into a 32-bit texture.
// Encode a float value into 3 bytes
// The input value should be in the range of [0, 1
// Reference: http://www.ozone3d.net/blogs/lab/?p=113
vec3 packFloatToVec3i(const float value)
{
 const vec3 bitSh = vec3(256.0 * 256.0, 256.0, 1.0);
 const vec3 bitMsk = vec3(0.0, 1.0/256.0, 1.0/256.0);
 vec3 res = fract(value * bitSh);
 res -= res.xxy * bitMsk;
 return res;
}
float unpackFloatFromVec3i(const vec3 value)
{
 const vec3 bitSh = vec3(1.0/(256.0*256.0), 1.0/256.0, 1.0);
 return dot(value, bitSh);
}

If there was a cached miss, the accumulative AO will be discarded and the instance AO value is used instead. Of course more samples can be taken in this frame to reduce the visual impact of the cache miss.

Discussion/improvements
  • Currently a new independent set of random samples were generated for the above video demo. Other random sample over time generation method may reduce the noise.
  • As some of the re-projection cache scheme suggested, a cache value should be cleared after a certain period of time to avoid in-stability and provide a better response to dynamic environment, and this is done here by the accumulation formula.
  • To reduce cache miss due to object movement, each object's last transformation matrix can also be incorporated into the algorithm.
  • The depth encoding scheme also make the blur pass much more efficient.
Conclusion

The explained algorithm provides a new way to improve the quality and efficiency of traditional SSAO by using the result from a number of frames instead of one. It also opens up more parameters and sampling patterns to explore with.

2009年2月17日星期二

用 C# 製作使用者界面

過去兩個星期開始製作使用者界面,選擇了 C# 加 Microsoft Windows Form 作為平台。
除了 C# 本身容易使用外,最令我覺得欣慰的就是那龐大的使用者社區,好一些想要的功能都已經有其他人做好了,或已有詳盡的教學可供參考。
以下是一些在我製作 Studio Tool 的時候所用到的一些 library:

2009年2月16日星期一

移除 ".svn" 文件夾

話說有一天我想著怎樣把一個專案之下的 ".svn" 文件夾全部移除,想著想著的都是一些 Linux command。最後找到這個:

find -type d -name .svn -exec rm -rf {} \;

怎知原來每天使用的 Tortoise SVN 已經有這個功能... 又做了一件愚蠢的事 XD

2009年2月4日星期三

九型人格分析

看來我滿喜歡建造遊戲引擎來成全自己亦希望同時成就他人。

九型人格分析
第二型助人者、全愛型、助人型、成就他人者、博愛型
14%
第一型完美主義者、完美型、改革者、改進型、秩序大使
13%
第六型忠誠型、忠誠型、尋找安全者、謹慎型
13%
第三型成就者、事業型、成就型、實踐型
12%
第四型藝術型、浪漫者、自我型、憑感覺者
11%
第五型智慧型、觀察者、思想型、理性分析者、思考型
9%
第七型快樂主義型、豐富型、活躍型、創造可能者、享樂型
9%
第八型領袖型、能力型、挑戰者、保護者、權威型
9%
第九型和平型、和平者、和諧型、維持和諧者
7%

2009年1月15日星期四

SSAO Demo

花了一點時間整理好我的 SSAO Demo,請按這裡下載。請各位多給意見 :)

操縱方法:
  • 鏡頭移動:W、A、S、D、Page up、Page down
  • 鏡頭方向:滑鼠左鍵
  • SSAO開關:F1,默認值:開
  • 銀幕減半開關:F2,默認值:開
  • 增大/減少模糊操作次數:F3/Shift+F3,默認值:2
  • 漫射材質開關:F4,默認值:關
  • 增大/減少閉塞半徑:Shift+F5/F5
後記:
  • 看過 R5 Demo 後才覺悟增大 blur pass 次數和 blur kernel size 的分別。 Orz
  • depth encode 到三個 8-bit integer,那麼一塊 32-bit 的 RGBA render target 就可以運載 occlusion value 和 depth 到模糊操作中,從而大大減少 texture fetch 的數量。
相關文章

2009年1月13日星期二

檔案監視器

看過了猴子靈藥的 "Database Hot Loader" 後,心動之下又想做些類似的東西,哪就是一個用來監視檔案系統的小工具。有了它,遊戲裡的任何素材(美術素材,以及音效、字型、腳本程序等等)檔案一經修改就會立即在遊戲裡反映出來,因而省去重新啟動遊戲程式的煩厭。當然背後還需健全的資源系統才能成事。

在視窗環境中,標準的方法是調用 FindFirstChangeNotificationReadDirectoryChangesW;前者告訴你某個資料夾有否被更改,後者還會告訴你甚麼檔案/資料夾曾被更改。怎知 ReadDirectoryChangesW 的調用殊不簡單,MSDN 又沒有範例,上 CodeProject 碰碰運氣得來的是一個 3000 多行代碼的類別,Google 一翻還是找不到想要的。

經過一翻努力和嘗試(我相信 MSDN 是有錯漏的),得知 GetOverlappedResultReadDirectoryChangesW 的配合是最簡單的;無須和任何多緒有關的東西打交道。我的 FileMonitor 就只有 constructor 和 getChangedFile 這兩個函數。

FileMonitor.h

#ifndef __FILEMONITOR__
#define __FILEMONITOR__

#include <string>

/*! To monitor file changes under a particular folder.
The implementation use the win32 ReadDirectoryChangesW() function
with GetOverlappedResult() to perform the monitoring, therefore
no thread is created and so making the interface very simple.

With the limitation of ReadDirectoryChangesW() is using a fixed buffer
to hold the information between calls of getChangedFile(), FileMonitor
may fail to detect file changes between calls of getChangedFile() if
the file names are too large to fit into the buffer. To overcome the
issue, you need to call getChangedFile() frequently.

\sa http://mtlung.blogspot.com/2009/01/blog-post.html

Example:
\code
FileMonitor monitor(L"pathToMonitor", true);
// In your main loop:
while(true) {
std::wstring path;
while(!(path = monitor.getChangedFile()).empty()) {
std::wcout << path << std::endl;
}
}
\endcode
*/
class FileMonitor
{
// FileMonitor is non-copyable
FileMonitor(const FileMonitor&);
FileMonitor& operator=(const FileMonitor&);

public:
/*! Constructor
\param path The path to monitor
\param recursive Watch the path recursively
\param operationTowatch Which file operation to monitor with,
having the same meaning as the dwNotifyFilter in ReadDirectoryChangesW()
function, see http://msdn.microsoft.com/en-us/library/aa365465(VS.85).aspx
for more details. By default, it's value is -1 and have the same meaning
as FILE_NOTIFY_CHANGE_LAST_WRITE
*/
FileMonitor(const wchar_t* path, bool recursive, int operationTowatch = -1);

~FileMonitor();

/*! Get which file under the watching directory is changed.
This function is non-blocking and if there is no changes in the
file system, it will simple return an empty string.

\note
The current implementation use a fixed buffer to capture all the
file changes between calls of getChangedFile(). If there are too
much changes or the file names get too long, the buffer overflow
and that file change notification will lost.
See more on the documentation of ReadDirectoryChangesW() in MSDN.
*/
std::wstring getChangedFile() const;

private:
class Impl; //!< Private implementation class
Impl* mImpl;
}; // FileMonitor

#endif // __FILEMONITOR__

FileMonitor.cpp

#include "FileMonitor.h"
#include <assert.h>
#include <iostream>
#include <list>

// Exclude rarely-used stuff from Windows headers
#ifndef WIN32_LEAN_AND_MEAN
# define WIN32_LEAN_AND_MEAN
#endif
#ifndef VC_EXTRALEAN
# define VC_EXTRALEAN
#endif
#include <windows.h>

#ifdef _MSC_VER // Currently only windows is supported

class FileMonitor::Impl
{
public:
Impl(const wchar_t* path, bool recursive, int operationTowatch)
: mRecursive(recursive), mOperationTowatch(operationTowatch)
{
assert(int(mBuffer) % 4 == 0 && "Address of mBuffer must be 4-byte aligned");

// Adjust the defalt value for mOperationTowatch
if(mOperationTowatch == -1)
mOperationTowatch = FILE_NOTIFY_CHANGE_LAST_WRITE;

mDirectory = ::CreateFileW(
path,
FILE_LIST_DIRECTORY,
FILE_SHARE_DELETE | FILE_SHARE_READ | FILE_SHARE_WRITE,
0,
OPEN_EXISTING,
// ReadDirectoryChangesW() needs FILE_FLAG_BACKUP_SEMANTICS
FILE_FLAG_OVERLAPPED | FILE_FLAG_BACKUP_SEMANTICS,
0
);

memset(&mOverlapped, 0, sizeof(mOverlapped));
if(!mDirectory || !readChange()) {
::CloseHandle(mDirectory);
mDirectory = NULL;
std::wcerr << L"Fail to watch directory: " << path << std::endl;
}
}

~Impl()
{
::CloseHandle(mDirectory);
}

bool readChange() const
{
return ::ReadDirectoryChangesW(
mDirectory,
mBuffer, sizeof(mBuffer),
mRecursive,
mOperationTowatch,
NULL, // bytesRetured
&mOverlapped,
0 // callBack
) != 0;
}

std::wstring getChangedFile() const
{
// We will try to call GetOverlappedResult() even there are entries inside
// mFiles, so that it's less possible for the mBuffer to be overflowed.

// For some unknown reason(s) ReadDirectoryChangesW() will report the file twice,
// therefore we add a loop to filter out those duplicated entries.
for(size_t i=2; i--;)
{
DWORD bytesRetured = 0;
if(0 == ::GetOverlappedResult(mDirectory, &mOverlapped, &bytesRetured, false))
goto CACHED; // The use of goto here makes the code clean.

if(bytesRetured == 0) {
// TODO: To reduce the chance of insufficient buffer,
// we can move the code to another thread.
std::wcerr << L"Error returned by ReadDirectoryChangesW(), "
L"most likely the internal buffer is too small" << std::endl;
readChange();
goto CACHED;
}

FILE_NOTIFY_INFORMATION* p = reinterpret_cast<FILE_NOTIFY_INFORMATION*>(mBuffer);
while(true)
{
std::wstring fileName(p->FileName, p->FileNameLength / sizeof(wchar_t));

// Skip duplicated entry
if(mFiles.empty() || fileName != mFiles.back())
mFiles.push_back(fileName);

if(p->NextEntryOffset == 0)
break;

p = reinterpret_cast<FILE_NOTIFY_INFORMATION*>((char*)p + p->NextEntryOffset);

// Do some extra buffer overflow check.
if((char*)p - (char*)mBuffer > sizeof(mBuffer))
break;
}

if(!readChange())
return L"";
}

CACHED:
if(!mFiles.empty()) {
std::wstring ret = mFiles.front();
mFiles.pop_front();
return ret;
}

return L"";
}

HANDLE mDirectory;
bool mRecursive;
int mOperationTowatch;
/*! This buffer must be 4-byte aligned, therefore we use int as the type.
You may change the buffer size to fit your needs.
*/
mutable int mBuffer[2048];
mutable OVERLAPPED mOverlapped;
//! A list of wstring acting as a circular buffer.
mutable std::list<std::wstring> mFiles;
}; // Impl

FileMonitor::FileMonitor(const wchar_t* path, bool recursive, int operationTowatch)
{
mImpl = new Impl(path, recursive, operationTowatch);
}

FileMonitor::~FileMonitor()
{
delete mImpl;
}

std::wstring FileMonitor::getChangedFile() const
{
__assume(mImpl); // We know mImpl is always not null, shut off the C++ analysis warning
return mImpl->getChangedFile();
}

#endif // _MSC_VER

Main.cpp

#include "FileMonitor.h"
#include <iostream>
#include <conio.h> // For _kbhit()

int main()
{
FileMonitor monitor(L"./", true);

std::wcout << L"Create and modify the files in the current directory, "
L"and the FileMonitor will tell you the name of those files.";
std::wcout << L" Press any key to quit the program" << std::endl;

while(!_kbhit())
{
std::wstring path;
// Keep polling the monitor, But a real application should
// only poll the monitor once a while.
while(!(path = monitor.getChangedFile()).empty())
{
std::wcout << path << std::endl;
}
}

return 0;
}