web scraping - C# Web scraper copying text -
i have web scraper written in c# extracting data. want copy text web browser control , paste word file programmatically. when try extract rich text box content using id , innertext, text contains encoded characters %2c.
i need text formatting can't find way. have tried encoding
, httputility.urldecode
, sendkeys
, elem.invokemember()
without success.
how can programmatically copy , paste text web browser control preserving formatting?
here sample data extract:
description
the advance concepts engineering team designs , develops new vehicles meet future regulatory requirements , customer competitive requirements. qualified candidate responsible total vehicle packaging. candidate identify , resolve adaptation , packaging issues vehicle moves toward production. lead cross functional team meetings working systems & components, advance manufacturing, service, etc. ensure solutions optimized stages of vehicle's life.
htmlelement elem = wb.document.getelementbyid("ctl00_contplhdynamic_txtdescrcontenthiddentextarea"); if (elem == null) return; elem.invokemember("click"); //elem.invokemember("select all"); //elem.invokemember("copy"); sendkeys.sendwait("^a"); sendkeys.sendwait("^c"); clipboard.clear(); elem.focus(); elem.invokemember("right click"); elem.invokemember("select all"); elem.invokemember("copy"); clipboard.settext(elem.innertext); string clipbrdtext = clipboard.gettext(); string data = elem.innertext; richtextbox1.text = data; string temp = system.web.httputility.urldecode(data); encoding iso = encoding.getencoding("windows-1252"); encoding utf8 = encoding.utf8; byte[] utfbytes = utf8.getbytes(data); byte[] isobytes = encoding.convert(utf8, iso, utfbytes); string msg = iso.getstring(isobytes);
the text "%2c" etc has been encoded. if getting content of web page, decoding html, not url. can use httputility.htmldecode
, or if using .net 4.0 or above can use webutility.htmldecode - available within system.net
namespace.
you should note word not use html formatting, won't able paste html tags , expect recognise them. i.e. <strong>description</strong>
not result in bold text if type word.
edit:
it looks mixing 2 different ways copy text in code pasted - both sendkeys.sendwait("^c");
, elem.invokemember("copy");
. presume both of these methods work?
i think problem having lies in way getting text. see you're using clipboard.gettext()
text. try specifying formatted text using clipboard.gettext(textdataformat.rtf)
or clipboard.gettext(textdataformat.html)
. should copy string preserving formatting.
Comments
Post a Comment