Archive: Idea for better compression


Idea for better compression
Until now, the datablock optimizer reudces the size of the installer only when the same file is referenced at least twice in the script. But what about almost same files, or files of the same type (example : source files) that have a lot of redundaancy across the files? Why not putting some king of solid archiving ? It would require some kind of sorting, by file extension first, then by filename alphabetically.

Here is the contents of Rarfiles.lst, a list that RAR/WinRAR uses to sort the files for solid archiving. Sorry for the french header.

; Liste pour ordonner les fichiers d'une archive solide
;
; Vous pouvez modifier l'ordre de tri que RAR utilise lorsque qu'il ajoute
; des fichiers à une archive solide
;
; Ce fichier peut contenir des noms de fichiers, des caractères génériques
; ou une entrée spéciale : $default. Cette entrée définit la position des
; fichiers qui ne correspondent pas aux autres entrées de ce fichier. Les
; lignes commençant par le symbole ";" sont considérés comme commentaires
; et ne donc pas traités.
;
; Placez ce fichier dans le même répertoire que RAR.EXE.
;
; Astuces pour améliorer le taux et la vitesse de la compression :
;
; - les fichiers contenant des informations similaires devraient être
: groupés ensemble dans l'archive si possible;
; - les fichiers fréquemment utilisés devraient être placés au début.
;
file_id.diz
descript.ion
read.*
readme.*
*.doc
*.txt
*.htm
*.html
*.shtml
*.lst
*.log
*.ini
*.bat
*.cmd
*.h
*.c
*.cpp
*.asm
*.bas
*.inf
*.bak
*.rtf
*.hlp
*.com
*.exe
*.dll
*.ovr
*.ovl
*.obj
*.lib
*.sys
*.drv
*.bin
*.bmp
*.wav
*.stm
$default
*.gif
*.jpg
*.tif
*.arj
*.ha
*.lzh
*.rar
*.zip

This kind of list could be easily be put nsisconf.nsi with a parsing similar to InstType. So, we would have in a NSI script, for example :

FileType "descript.ion"
FileType "read.*"
FileType "readme.*"
FileType "*.doc"
FileType "*.txt"
FileType "*.htm"
FileType "*.html"
FileType "*.shtml"
FileType "*.lst""
FileType "*.log"
FileType "*.ini"
.
.
.


AFAIK every file in the installer is compressed seperately.
If the files are archived 'solid', you'd have to extract all of them if you're installing (or at least ignore them).
If solid archiving is added, then bzip2 compression will improve for small files too.


what if the dictionary were stored at a different offset and then through some hacking of zlib (I dont know about bzip, I am embarassed to say I haven't used it before), the locations of the compressed file data in the long block were noted instead of the offset to the 'Section'. Then you could extract files ad hoc independently after prepending the dictionary to the raw data with out any modification of the zlib inflate code.


Forgot to add, perhaps as Repzilon inferred, ASCII and binary files could be seperated so as to improve the Huffman weightings in the dictionary and raw data.


I update my specification.

First of all, in order to have a faster decompression (because decompressing a part of a solid archive is a pain), instead of being a 100% solid compression, it would be "solid section" compression. I explain. Each section behaves like a separate solid archive, and then, you put them all in the NSIS generated installer.

My FileType proposition changes a little bit. It is now:
FileType ext_spec ( text | bin )
We can now use the Zlib strategy more efficiently.

Then, there is parsing code to change. We would have first to sort each instruction for each section, File instructions first. After that, we need to sort the File instructions using the sort key defined with FileType instruction, otherwise, we sort by file type, then by file name.

Finally, edit the datablock optimizer code to do solid archiving.