Sample in C#:
For starters, add a reference to Microsoft.Office Interop.Word
. Then you can do some basic parsing:
var wdApp = new Application();
var dict = new Dictionary<string, string>();
//paths is some collection of paths to the Word documents
//You can use Directory.EnumerateFiles to get such a collection from a folder
//EnumerateFiles also allows you to filter the files, say to only .doc
foreach (var path in paths) {
var wdDoc = wdApp.Documents.Open(path);
foreach (Paragraph p in wdDoc.Paragraphs) {
var text = p.Range.Text;
var delimiterPos = text.IndexOf(";");
dict.Add(
text.Substring(0, delimiterPos - 1),
text.Substring(delimiterPos + 1)
);
}
wdDoc.Close();
}
//This can be done more cleanly using LINQ, but Dictionary<TKey,TValue> doesn't have an AddRange method.
//OTOH, such a method can be easily added as an extension method, taking IEnumerable<KeyValuePair<TKey,TValue>>
For more complex parsing, you can save each item as a new textfile:
var newPaths =
from path in paths
select new {
path,
//If needed, add some logic to put the textfile in a different folder
newPath = Path.ChangeExtension(path, ".txt")
};
var wdApp = new Application();
foreach (var item in newPaths) {
var wdDoc = wdApp.Documents.Open(item.path);
wdDoc.SaveAs2(
FileName: item.newPath,
FileFormat: WdSaveFormat.wdFormatText
);
wdDoc.Close();
}
You may also need to create a file named schema.ini
and put it in the same folder as the text files (more details on the syntax here):
//assuming the delimiter is a ;
File.WriteAllLines(schemaPath,
from item in newPaths
select String.Format(@"
[{0}]
Format=Delimited(;)
", item.filename)
);
Then, you can query the resulting text files using SQL statements, via the OleDbConnection
, OleDbCommand
, and OleDbReader
classes.
foreach (var item in newPaths) {
var connectionString = @"
Provider=Microsoft.Jet.OLEDB.4.0;
Extended Properties=""text;HDR=NO;IMEX=1;""
Data Source=" + item.newPath;
using (var conn = new OleDbConnection(connectionString)) {
using (var cmd = conn.CreateCommand()) {
cmd.CommandText = String.Format(@"
SELECT *
FROM [{0}]
", item.newPath);
using (var rdr = cmd.ExecuteReader()) {
//parse file contents here
}
}
}
}