web scraping - C# Web scraper copying text -


i have web scraper written in c# extracting data. want copy text web browser control , paste word file programmatically. when try extract rich text box content using id , innertext, text contains encoded characters %2c.

i need text formatting can't find way. have tried encoding, httputility.urldecode, sendkeys , elem.invokemember() without success.

how can programmatically copy , paste text web browser control preserving formatting?

here sample data extract:

description

the advance concepts engineering team designs , develops new vehicles meet future regulatory requirements , customer competitive requirements. qualified candidate responsible total vehicle packaging. candidate identify , resolve adaptation , packaging issues vehicle moves toward production. lead cross functional team meetings working systems & components, advance manufacturing, service, etc. ensure solutions optimized stages of vehicle's life.

htmlelement elem = wb.document.getelementbyid("ctl00_contplhdynamic_txtdescrcontenthiddentextarea");                 if (elem == null) return;                 elem.invokemember("click");                 //elem.invokemember("select all");                 //elem.invokemember("copy");                 sendkeys.sendwait("^a");                 sendkeys.sendwait("^c");                  clipboard.clear();                 elem.focus();                 elem.invokemember("right click");                 elem.invokemember("select all");                 elem.invokemember("copy");                  clipboard.settext(elem.innertext);                 string clipbrdtext = clipboard.gettext();                  string data = elem.innertext;                  richtextbox1.text = data;                 string temp = system.web.httputility.urldecode(data);                  encoding iso = encoding.getencoding("windows-1252");                 encoding utf8 = encoding.utf8;                 byte[] utfbytes = utf8.getbytes(data);                 byte[] isobytes = encoding.convert(utf8, iso, utfbytes);                 string msg = iso.getstring(isobytes); 

the text "%2c" etc has been encoded. if getting content of web page, decoding html, not url. can use httputility.htmldecode, or if using .net 4.0 or above can use webutility.htmldecode - available within system.net namespace.

you should note word not use html formatting, won't able paste html tags , expect recognise them. i.e. <strong>description</strong> not result in bold text if type word.

edit:

it looks mixing 2 different ways copy text in code pasted - both sendkeys.sendwait("^c"); , elem.invokemember("copy");. presume both of these methods work?

i think problem having lies in way getting text. see you're using clipboard.gettext() text. try specifying formatted text using clipboard.gettext(textdataformat.rtf) or clipboard.gettext(textdataformat.html). should copy string preserving formatting.


Comments

Popular posts from this blog

android - MPAndroidChart - How to add Annotations or images to the chart -

javascript - Add class to another page attribute using URL id - Jquery -

firefox - Where is 'webgl.osmesalib' parameter? -