Hi,
I'm working on job site (using asp.net 3.5) in which a user can upload one or more resumes. Database is sql server 2005. Whats the recommendation regarding storage of these documents? Shall i store it as a blob in the database or just store the path in database and keep the document in the physical folder? Can someone please mention the pros and cons of each wrt performance and scalability?
Thanks!
-
Storing files in the database gives you:
(+) easier house-keeping, no orphaned files on the disk
(+) consistent programming and access model (authorization)
(-) Database grows exorbitantly, which will increase backup time and disk space use to hold backup files
I don't think that either option scales better than the other. If you expect very large numbers of documents, you will hit limits regarding the number of files in a directory (either as an absolute limit, or because directory lookups begin to perform very poorly at a certain point).
Jeremy Thomson : thanks for your answer...! -
Keeping your data in the database will make managing the date much easier. You won't have to make sure the data in the database is in sync with the data (file paths) on the filesystem.
Having said that, it is important to realize there might be a performance hit when storing large blobs in the database. Fortunately, there has been some research done in this area to investigate possible scenarios. I would recommend you take a look at this paper by Microsoft Research on the subject. Hope that helps.
Furthermore, I should mention that Microsoft SharePoint stores its Word documents (as well as other files) in SQL Server databases as blobs. You can read more about this here.
-
There's nothing to say you can't do both - parse the doc & store important meta info (such as contact info, frequently used words, etc) into indexed columns, store a path to the document. That is what I did for a high availability (upto 21M impressions/month) db, but we were serving media files, but the file size & principal is about the same.
-
You could try storing the document files in a separate database and use the Simple Recovery Model. Then you can easily back up the documents database without worrying about your backups ballooning in size.
-
I've had two systems that stored files in folders - both scaled poorly when we got to about 20K files in a folder. Can be solved by using subfolders (2008-01, 2008-02 for example) but as said by others, you have synching issues.
Database allows you to archive/delete old resumes and no synching problems. You can compress/uncompress the text when storing/retrieving as long as you don't want to full text search the data. Assuming a 10-20K file, you could compress to 1-2K probably. Not that big.
-
I'm in charge of some applications that manage many TB of images. I think this application is similar in that it's storing files relatively large to typical database rows. We've found that storing file paths in the database to be best.
There are a couple of issues:
- database storage is usually more expensive than file system storage.
- you can accelerate/distribute file system access with standard off the shelf products.
- things like web servers, etc, need no special coding or processing to access images in the file system.
-
I recommend storing it in the database to keep things together.
One additional point to consider when storing in the the database is that a SQL database can be comprised of multiple files. What I would advise is to create A second file (.ndf) on this database that houses the tables that contain BLOB data.
This provides some key benefits in that it can (1) be run on a separate disk array if increased demand ever calls for it. (2) can be backed up and restored separately. (3) will not bloat your main OLTP database file.
-
I'd go with using the best tool for each job:
- the best database type for storing and querying data is a relational database.
- the best database for managing files is the file system.
Using the filesystem will allow you to use the right tools for managing files: you can use the OS's compression features, use incremental backups tools, etc and not bloat your database with sometimes poorly constructed Word Documents that may contain large images.
As the number of files gets large, the problem is easy to solve from the point of view of the filesystem: store the files into relative path directories so you can keep the number of file per directory under control (for instance, creating a weekly folder or creating a new folder when the file count is over a certain limit).
Then you only need to keep the relative path of the file in the database.I would also normalise the filenames at the time of submission to keep them consistent, like
20090219-001.doc
.As other suggested, you could parse the doc file to extract the pure textual information and store it in the database to get the benefits of text search.
0 comments:
Post a Comment